Skip to content

Align grouped first_value/last_value FILTER handling with shared Some(true) semantics#22681

Open
kosiew wants to merge 4 commits into
apache:mainfrom
kosiew:filtering-logic-01-22665
Open

Align grouped first_value/last_value FILTER handling with shared Some(true) semantics#22681
kosiew wants to merge 4 commits into
apache:mainfrom
kosiew:filtering-logic-01-22665

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Jun 1, 2026

Which issue does this PR close?

Rationale for this change

Grouped first_value / last_value used a local filter check based on BooleanArray::value(idx). This reads only the value bit and does not enforce aggregate FILTER semantics, where a row should pass only when the predicate is Some(true).

DataFusion already has a shared helper, filter_to_validity, that converts a nullable boolean filter into a validity bitmap implementing the correct semantics (Some(true) passes; Some(false) and None do not). Other grouped aggregate helper paths already use this helper.

This change aligns grouped first_value / last_value with the existing shared implementation and avoids semantic drift in nullable FILTER handling.

What changes are included in this PR?

  • Make filter_to_validity publicly accessible from datafusion-functions-aggregate-common.

  • Update FirstLastGroupsAccumulator::get_filtered_extreme_of_each_group to:

    • precompute filter validity once per batch using filter_to_validity
    • evaluate aggregate FILTER pass/fail decisions from the resulting validity bitmap rather than directly reading BooleanArray::value(idx)
  • Minor cleanup in the grouped accumulator loop by destructuring group_idx directly.

  • Add targeted grouped accumulator regression tests covering nullable filters where the underlying value bit is true but the filter validity is NULL:

    • grouped first_value
    • grouped last_value
    • grouped merge_batch
  • Add SQLLogicTest coverage for grouped first_value and last_value with nullable FILTER predicates, including:

    • groups where all rows are rejected and the result is NULL
    • groups where only rows with TRUE predicates participate in the aggregate

Are these changes tested?

Yes.

Added Rust unit tests in datafusion/functions-aggregate/src/first_last.rs:

  • test_first_group_acc_rejects_null_filter_with_true_value_bit
  • test_last_group_acc_rejects_null_filter_with_true_value_bit
  • test_first_group_acc_merge_rejects_null_filter_with_true_value_bit

Added SQLLogicTest coverage in:

  • datafusion/sqllogictest/test_files/aggregate.slt

Validation performed:

cargo test -p datafusion-functions-aggregate first_last --lib
cargo test -p datafusion-functions-aggregate-common test_accumulate_indices_with_null_filter --lib
cargo test -p datafusion-sqllogictest --test sqllogictests -- aggregate.slt

Are there any user-facing changes?

Yes.

This fixes aggregate FILTER behavior for grouped first_value and last_value when nullable filter predicates are involved. Rows with a NULL filter predicate are now consistently treated as filtered out, matching aggregate FILTER semantics and the behavior already implemented by the shared helper.

No SQL syntax or public user-facing APIs are changed. The only API change is that filter_to_validity is made public within the aggregate-common crate to enable reuse by grouped aggregate implementations.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 3 commits June 1, 2026 11:44
…on functionality

- Updated `filter_to_validity` to public in `groups_accumulator/nulls.rs`.
- Modified `first_last.rs` to use shared `filter_to_validity` for grouped first/last.
- Enhanced handling to reject NULL filter rows even if the value bit is true.
- Added tests for:
- First update path
- Last update path
- First merge path
- Destructured group_indices loop by removing redundant variable assignment.
- Added test helper: new_int64_first_last_group_acc(...) for improved test setup.
- Added test helper: nullable_bool_filter(...) to facilitate testing with nullable booleans.
- Replaced repeated test setup across tests with the newly created helpers for better maintainability.
- Changed valid to validity in nullable_bool_filter.
- Added assert_group_acc_int64_result function.
- Replaced repeated evaluate/downcast/assert blocks in three tests for better maintainability.
@github-actions github-actions Bot added the functions Changes to functions implementation label Jun 1, 2026
@kosiew kosiew marked this pull request as ready for review June 1, 2026 04:45
@Nagato-Yuzuru
Copy link
Copy Markdown

The same bug in #22666. Looks like this PR covers that one too. Might be worth a sqllogictest on top since the repro was SQL. Here's mine if you want it (without the fix, g=1 returns 10 instead of NULL):

# Grouped first_value/last_value must apply aggregate FILTER with Some(true)
# semantics: a row passes only when the predicate is TRUE. Rows where the
# predicate evaluates to NULL or FALSE must be excluded.
#
# Rows per group (predicate is b < 1):
#   g=1: (a=10, b=NULL -> NULL), (a=20, b=2 -> FALSE)        => no rows pass
#   g=2: (a=30, b=0 -> TRUE), (a=40, b=NULL -> NULL),
#        (a=50, b=-5 -> TRUE)                                => a=30 and a=50 pass
#   g=3: (a=60, b=NULL -> NULL)                              => no rows pass
statement ok
CREATE TABLE first_last_filter_null_tests(g INT, a INT, b INT) AS VALUES
(1, 10, CAST(NULL AS INT)),
(1, 20, 2),
(2, 30, 0),
(2, 40, CAST(NULL AS INT)),
(2, 50, -5),
(3, 60, CAST(NULL AS INT));

# Groups 1 and 3 have no rows passing the filter -> NULL.
# Group 2 has a=30 and a=50 passing -> first_value ORDER BY a = 30.
query II
SELECT g, first_value(a ORDER BY a) FILTER (WHERE b < 1) AS fv
FROM first_last_filter_null_tests
GROUP BY g
ORDER BY g;
----
1 NULL
2 30
3 NULL

# Same groups via last_value: group 2 picks the largest passing a = 50.
query II
SELECT g, last_value(a ORDER BY a) FILTER (WHERE b < 1) AS lv
FROM first_last_filter_null_tests
GROUP BY g
ORDER BY g;
----
1 NULL
2 50
3 NULL

statement ok
DROP TABLE first_last_filter_null_tests;

@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jun 2, 2026
@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Jun 2, 2026

@Nagato-Yuzuru
Thanks, I added your slt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants