-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#4605: Add native query compiler #7259
Conversation
f80e353
to
41bab97
Compare
modin/experimental/core/storage_formats/pandas/small_query_compiler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start on solving this problem! Is it possible to avoid so many of the test changes?
modin/config/envvars.py
Outdated
@@ -851,4 +851,11 @@ def _check_vars() -> None: | |||
) | |||
|
|||
|
|||
class UsePlainPandasQueryCompiler(EnvironmentVariable, type=bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is probably a little confusing for users. I suggest something like SmallDataframeMode
. This can be set to None
by default, and users can set it to "pandas"
or some other option in the future (we may have some other single node options coming).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devin-petersohn, do you think VanillaPandasMode
is a good option? Also, why do you think we should make this config of string
type to have choices None
/pandas
/etc.
? Wouldn't it be sufficient to have this config boolean - enable/disable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may add polars
mode. If this happens, we might also want to have an option for that. Making it a string keeps it open to other options. If we have pandas in the name, we can only use that mode for pandas execution. I'm open to other names, but I think we don't want to keep adding more and more configs if we have more options later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this sound like we may have multiple storage formats for a single execution? Do we really want to support this in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, yes I think this is something we could support in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devin-petersohn, do you think we could support automatic initialization with small qc depending on a data size threshold in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to rename UsePlainPandasQueryCompiler
to NativeDataframeMode
and SmallQueryCompiler
to NativeQueryCompiler
by sort of analogy with HdkOnNative
we had previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, a more complete definition of this class in the docstring is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the name to UsePlainPandasQueryCompiler to NativeDataframeMode and SmallQueryCompiler to NativeQueryCompiler.
The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided? |
Nothing specific, I was just trying to understand context. Thanks! |
e6b035f
to
d406414
Compare
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: arunjose696 <arunjose696@gmail.com>
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arunjose696, please fix ci-required / lint (pydocstyle) (pull_request)
job. Other than that, LGTM!
3da8c6e
to
17b12aa
Compare
Signed-off-by: arunjose696 <arunjose696@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date