-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native Polars support #258
Conversation
Should be ready for review now! |
Adding in some tweaks to help with cleaning up the code base using an ABC. Ref: #260 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of review. At a high level I think the code is good (but a lot to go through at once). As we discussed offline, I think there's an opportunity to use Polars' lazy query optimization throughout but it might be easier for me to PR into the branch because it would be a lot of little changes throughout.
values don't match. | ||
""" | ||
compare: pl.Series | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this function, should we check the type first before trying matching logic? I.e. Something like this check to check if the first column is numeric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I might tweak that a bit. I just took the existing pandas version and ported it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ak-gupta something like.. ?
if col_1.is_numeric() and col_2.is_numeric():
compare = pl.Series(
np.isclose(col_1, col_2, rtol=rel_tol, atol=abs_tol, equal_nan=True)
)
else:
try:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! I think doing those checks might make the code a bit more efficient. Similarly, we could use the is_temporal
check for datetime columns as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like
if col_1.dtype.is_numeric() and col_1.dtype.is_numeric():
# Numeric check with float cast
return ...
either_temporal: bool = col_1.dtype.is_temporal() or col_2.dtype.is_temporal()
for col in [col_1, col_2]:
if str(col.dtype) in STRING_TYPE:
# Ignore case and spaces
if either_temporal:
# Cast to datetime
if either_temporal:
# Datetime comparison
else:
# eq_missing comparison
Given that your current logic is a lift-and-shift implementation from Pandas we can mark the update as optional.
compare = pl.Series( | ||
np.isclose(col_1, col_2, rtol=rel_tol, atol=abs_tol, equal_nan=True) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use Polars logic exclusively for a function like this? Maybe something along the lines of
(
df
.lazy()
.with_columns(
[
pl.col(col_1).is_null().alias("__DATACOMPY_NULL_COL_1"),
pl.col(col_2).is_null().alias("__DATACOMPY_NULL_COL_2"),
(pl.col(col_1) - pl.col(col_2)).abs().alias("__DATACOMPY_ABS_DIFF")
]
)
.select(
[
(
pl.when(
(pl.col("__DATACOMPY_NULL_COL_1") == True) & (pl.col("__DATACOMPY_NULL_COL_2") == True)
)
.then(True)
.when(
pl.col("__DATACOMPY_NULL_COL_1") != pl.col("__DATACOMPY_NULL_COL_2")
)
.then(False)
.when(
(pl.col("__DATACOMPY_ABS_DIFF") <= rel_tol) & (pl.col("__DATACOMPY_ABS_DIFF") <= abs_tol)
)
.then(True)
.otherwise(False)
).alias("__DATACOMPY_COMPARE")
]
)
.collect()
.to_series()
)
I did some very rough benchmarking of this function against what's in the PR already. At 100 000 rows I get
Polars Execution time (in s): 0.003081 +/- 0.000107
Numpy Execution time (in s): 0.005354 +/- 0.000219
at 1 000 000,
Polars Execution time (in s): 0.022880 +/- 0.000342
Numpy Execution time (in s): 0.048174 +/- 0.001574
Of course my logic would need to be further tested -- I made sure to assert the outputs at the end of the benchmarking but it's very much POC code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: the next step optimization
Let's tackle this point in a separate PR. Honestly, contributing an isclose
style expression to Polars might be a better idea anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this issue in the Polars repository
Feel free to do that! If it is easier. |
Just noting a offline discussion with @ak-gupta that this PR is getting a bit big, so it might make sense to iterate on the Polars performance in quick follow up PRs as this is still experimental. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked an additional optional change. Based on my review I don't see functional problems so I think we can move ahead with the "merge first, optimize second" approach.
values don't match. | ||
""" | ||
compare: pl.Series | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like
if col_1.dtype.is_numeric() and col_1.dtype.is_numeric():
# Numeric check with float cast
return ...
either_temporal: bool = col_1.dtype.is_temporal() or col_2.dtype.is_temporal()
for col in [col_1, col_2]:
if str(col.dtype) in STRING_TYPE:
# Ignore case and spaces
if either_temporal:
# Cast to datetime
if either_temporal:
# Datetime comparison
else:
# eq_missing comparison
Given that your current logic is a lift-and-shift implementation from Pandas we can mark the update as optional.
Yup, I'm assigned with the merge and then optimize. Would be good to just have a starting base to work off. Going to merge. |
* adding in first commit of Polars port * small fixes to reporting * renaming from core_polars to polars * updating usage docs for polars * updating docs and some clean up of README * adding in pytest.importorskip * adding in pytest.importorskip * fixing imports * fixing mypy and minor bugs * fixing polars test bugs * adding in abc class * using base and cleanup
Native Polars Support
This PR adds functionality to provide native Polars support into
datacompy
. This is more or less a direct port of the Pandas version but using Polars. There are some nuances (like no index) but for the most part there is a lot of parity between the two. This is to compliment and not the same as the Fugue version which will run the underly logic using Pandas when passing in a Polars DataFrame.Some performance metrics to consider (taken from the issue below):
Some initial benchmarking with Polars vs Pandas.
Polars
Pandas
The 100M run caused the kernel to die, so it is omitted from the testing.
Closes #257