-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify and clean up data skipping logic a bit #415
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks! One small nit, but lgtm
@@ -319,7 +319,8 @@ fn evaluate_expression( | |||
Equal => |l, r| eq(l, r).map(wrap_comparison_result), | |||
NotEqual => |l, r| neq(l, r).map(wrap_comparison_result), | |||
Distinct => |l, r| distinct(l, r).map(wrap_comparison_result), | |||
_ => return Err(Error::generic("Invalid expression given")), | |||
// NOTE: [Not]In was already covered above | |||
In | NotIn => return Err(Error::generic("Invalid expression given")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This isn't really an invalid operation, it's just that we expected to have matched above, so maybe:
In | NotIn => return Err(Error::generic("Invalid expression given")), | |
In | NotIn => return Err(Error::internal("Encountered In/NotIn in general BinaryOperation match")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think there is one issue.
Literal(Scalar::Boolean(value)) => Some(Expr::literal(*value != inverted)), | ||
Literal(_) => None, // Non-boolean literals unsupported | ||
// The expression <col> is equivalent to <col> != FALSE. Resembling NotEqual case below. | ||
Column(col) => Some(Expr::or( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes an error if the column isn't a boolean column, even if it could be coerced to a boolean.
So a query that might work:
SELECT * FROM foo WHERE int_col
will cause kernel to error.
Not sure what the best course of action is wrt. to doing coercion in our expression eval, checking if the col is bool here, or just giving up on col expressions for now.
This is actually why the test for expression:
Expression::column("number").and(empty_struct.clone().is_null()),
is failing.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expression framework docs are very clear that kernel always expects exact type matches, and that engine is responsible to apply any casting or coercion that should happen. Granted, we don't currently provide a cast expression, so that could be difficult to do in practice.
I guess the real question is -- is kernel responsible to prevent/swallow all query errors that might result from a data skipping predicate that was derived from a syntactically incorrect WHERE clause the engine gave it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we would expect that if the engine sees the user has written WHERE int_col
it would rewrite that as WHERE (cast int_col as boolean)
and then give that to kernel (which as you mention is hard given we don't have a cast)?
Potentially superseded by #420 |
Several cleanups to the data skipping logic (partly inspired by recently added parquet stats skipping logic):
Expression
match arms to match declaration orderinverted
flag to the data skipping expression logic, which allows to more easily push NOT downas_inverted_data_skipping_predicate
(no longer needed)FALSE OR <expr>
and<col> OR <expr>
no longer blocks data skipping on<expr>
as_data_skipping_predicate
(e.g.Equal
comparison)NotEqual
data skipping logic a bit easier to grok