Skip to content

PARQUET-3479: Add configuration to disable early dictionary compression check#3556

Open
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/parquet_3479
Open

PARQUET-3479: Add configuration to disable early dictionary compression check#3556
yadavay-amzn wants to merge 1 commit into
apache:masterfrom
yadavay-amzn:fix/parquet_3479

Conversation

@yadavay-amzn
Copy link
Copy Markdown
Contributor

@yadavay-amzn yadavay-amzn commented May 11, 2026

Problem

FallbackValuesWriter calls isCompressionSatisfying() after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files.

As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved.

Fix

Add a configurable property ParquetProperties.isDictionaryEarlyCheckEnabled() (default: true for backward compatibility) that controls whether the first-page compression check is performed in FallbackValuesWriter.getBytes().

When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (shouldFallBack()), not based on the first-page compression ratio.

Changes

  • ParquetProperties: added dictionaryEarlyCheckEnabled field, getter, and builder method
  • FallbackValuesWriter: added overloaded of() factory and constructor accepting the flag; guarded the isCompressionSatisfying call
  • DefaultValuesWriterFactory: passes the config through to FallbackValuesWriter.of()
  • New test TestFallbackValuesWriter: verifies dictionary encoding is preserved when the check is disabled

Testing

  • New unit tests pass (2/2)
  • Existing parquet-column tests unaffected (default true preserves existing behavior)

@yadavay-amzn
Copy link
Copy Markdown
Contributor Author

@Fokko Could you take a look? This adds a config (parquet.dictionary.early.check.enabled) to disable the first-page compression check in FallbackValuesWriter. With modern page-index defaults (~20k rows/page), the check fires too early for moderate-cardinality columns, abandoning dictionary encoding prematurely. Includes unit test + E2E integration test writing real Parquet files. Thanks!

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Jun 3, 2026

Thanks for looking into this!

While the problem is real, introducing a boolean flag dictionaryEarlyCheckEnabled feels like a band-aid fix that pushes the burden to users. Most users won't know when to manually toggle this to prevent storage inflation.

Instead of a new config, could we make the heuristic more adaptive? For example, we could delay the compression check until we've accumulated a certain amount of raw data (e.g., 1MB), or evaluate it over the first N pages rather than just the first one.

This would solve the issue out of the box without hurting usability. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants