Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and clean up data skipping logic a bit #415

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Oct 22, 2024

Several cleanups to the data skipping logic (partly inspired by recently added parquet stats skipping logic):

  • Reorder Expression match arms to match declaration order
  • Add an inverted flag to the data skipping expression logic, which allows to more easily push NOT down
    • Remove as_inverted_data_skipping_predicate (no longer needed)
    • Collapse the four {tight, wide} x {null, not null} helper methods into just tight and wide helpers
  • Add support for boolean literals and columns
    • They don't necessarily have good skipping power, but defaulting them to NULL can disable skipping
    • i.e. FALSE OR <expr> and <col> OR <expr> no longer blocks data skipping on <expr>
  • No longer create new expressions that need to be passed to as_data_skipping_predicate (e.g. Equal comparison)
  • Make NotEqual data skipping logic a bit easier to grok

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 22, 2024
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! One small nit, but lgtm

@@ -319,7 +319,8 @@ fn evaluate_expression(
Equal => |l, r| eq(l, r).map(wrap_comparison_result),
NotEqual => |l, r| neq(l, r).map(wrap_comparison_result),
Distinct => |l, r| distinct(l, r).map(wrap_comparison_result),
_ => return Err(Error::generic("Invalid expression given")),
// NOTE: [Not]In was already covered above
In | NotIn => return Err(Error::generic("Invalid expression given")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This isn't really an invalid operation, it's just that we expected to have matched above, so maybe:

Suggested change
In | NotIn => return Err(Error::generic("Invalid expression given")),
In | NotIn => return Err(Error::internal("Encountered In/NotIn in general BinaryOperation match")),

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think there is one issue.

Literal(Scalar::Boolean(value)) => Some(Expr::literal(*value != inverted)),
Literal(_) => None, // Non-boolean literals unsupported
// The expression <col> is equivalent to <col> != FALSE. Resembling NotEqual case below.
Column(col) => Some(Expr::or(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes an error if the column isn't a boolean column, even if it could be coerced to a boolean.

So a query that might work:

SELECT * FROM foo WHERE int_col

will cause kernel to error.

Not sure what the best course of action is wrt. to doing coercion in our expression eval, checking if the col is bool here, or just giving up on col expressions for now.

This is actually why the test for expression:

            Expression::column("number").and(empty_struct.clone().is_null()),

is failing.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expression framework docs are very clear that kernel always expects exact type matches, and that engine is responsible to apply any casting or coercion that should happen. Granted, we don't currently provide a cast expression, so that could be difficult to do in practice.

I guess the real question is -- is kernel responsible to prevent/swallow all query errors that might result from a data skipping predicate that was derived from a syntactically incorrect WHERE clause the engine gave it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we would expect that if the engine sees the user has written WHERE int_col it would rewrite that as WHERE (cast int_col as boolean) and then give that to kernel (which as you mention is hard given we don't have a cast)?

@scovich scovich added the merge hold Don't allow the PR to merge label Oct 23, 2024
@scovich
Copy link
Collaborator Author

scovich commented Oct 23, 2024

Potentially superseded by #420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump merge hold Don't allow the PR to merge
Development

Successfully merging this pull request may close these issues.

2 participants