Skip to content

fix: Improve consistency of per-column stats on FilterExec output#22718

Open
neilconway wants to merge 1 commit into
apache:mainfrom
neilconway:neilc/fix-stats-filter-caps-null-byte-size
Open

fix: Improve consistency of per-column stats on FilterExec output#22718
neilconway wants to merge 1 commit into
apache:mainfrom
neilconway:neilc/fix-stats-filter-caps-null-byte-size

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

#21081 capped the NDV at the row count when computing statistics for several operators. This PR extends that work and ensures that per-column statistics for filter operators are consistent with the estimated output row count. In particular:

  • Null count is also capped at the row count
  • Byte size is scaled down by the estimated selectivity

We also extend the analysis to consider null-rejecting predicates; for example, the clause a = 10 as a top-level conjunct implies that the null-count of the surviving rows is exactly 0.

What changes are included in this PR?

  • Ensure per-column statistics (null count, byte size) are consistent with filtered row count
  • Check for null-rejecting predicates to estimate a more accurate null count of 0
  • Update SLT expected plans
  • Add unit tests for new behavior
  • Various refactoring and comment improvements

Are these changes tested?

Yes; new tests added.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jun 2, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

cc @asolimando @gene-bordegaray @xudong963

Copy link
Copy Markdown
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @neilconway for the PR, I left a few comments but nothing major/blocking

///
/// Equality predicates (`col = literal`) set NDV to `Exact(1)`, or
/// `Exact(0)` when the predicate is contradictory (e.g. `a = 1 AND a = 2`).
/// (either default or estimated) to input statistics.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is quite interesting and IMO the docstring could summarize a little more what it does, especially as this is public

/// input, rows where that column is NULL cannot survive.
///
/// This analysis is conservative; for example, OR clauses are not considered
/// null-rejecting; neither are indirect operands like `a + 1 < 10`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is incomplete and it's fine, but we might add support at least for IS NOT NULL as it should be both easy and what people would intuitively expect. wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's pretty easy to add.

It's a shame that there is extract_null_rejecting_columns in the logical planner, but we can't easily reuse it...

) -> Precision<usize> {
match filtered_num_rows {
Precision::Exact(0) | Precision::Inexact(0) => filtered_num_rows,
_ => Precision::Exact(1),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: what about returning Inexact(1) when matching Absent?

/// filtered row estimate, since a column cannot have more nulls or distinct
/// values than it has rows. Known counts are demoted to inexact because the
/// filtered row count is itself an estimate.
fn cap_at_rows(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a must-have, but you might be interested in checking what is implemented in Apache Calcite for the same case: https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/metadata/RelMdUtil.java#L317

This is a more refined estimator than the proposed one, but we can postpone to a follow-up PR (this was on my radar already, I will get to it at some point).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool -- yeah, I agree we could do better here. Makes sense to defer to a followup PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filter stats should ensure column-level stats are consistent with filter

2 participants