PARQUET-3479: Add configuration to disable early dictionary compression check#3556
PARQUET-3479: Add configuration to disable early dictionary compression check#3556yadavay-amzn wants to merge 1 commit into
Conversation
ddf1332 to
2609cdc
Compare
|
@Fokko Could you take a look? This adds a config ( |
2609cdc to
287f352
Compare
|
Thanks for looking into this! While the problem is real, introducing a boolean flag Instead of a new config, could we make the heuristic more adaptive? For example, we could delay the compression check until we've accumulated a certain amount of raw data (e.g., This would solve the issue out of the box without hurting usability. Thoughts? |
Problem
FallbackValuesWritercallsisCompressionSatisfying()after the first page to decide whether dictionary encoding is worthwhile. With modern page-index defaults (~20k rows per page), this check fires too early for moderate-cardinality columns — dictionary encoding gets abandoned before enough data has accumulated to show its benefit, resulting in significantly larger files.As reported in #3479, a column with 1M int64 values mod 32768 produces 8.4MB with the premature fallback vs 2.2MB when dictionary encoding is preserved.
Fix
Add a configurable property
ParquetProperties.isDictionaryEarlyCheckEnabled()(default:truefor backward compatibility) that controls whether the first-page compression check is performed inFallbackValuesWriter.getBytes().When disabled, dictionary encoding is only abandoned when the dictionary itself exceeds size limits (
shouldFallBack()), not based on the first-page compression ratio.Changes
ParquetProperties: addeddictionaryEarlyCheckEnabledfield, getter, and builder methodFallbackValuesWriter: added overloadedof()factory and constructor accepting the flag; guarded theisCompressionSatisfyingcallDefaultValuesWriterFactory: passes the config through toFallbackValuesWriter.of()TestFallbackValuesWriter: verifies dictionary encoding is preserved when the check is disabledTesting
parquet-columntests unaffected (defaulttruepreserves existing behavior)