Native Polars support #258

fdosani · 2024-01-12T16:20:09Z

Native Polars Support

This PR adds functionality to provide native Polars support into datacompy. This is more or less a direct port of the Pandas version but using Polars. There are some nuances (like no index) but for the most part there is a lot of parity between the two. This is to compliment and not the same as the Fugue version which will run the underly logic using Pandas when passing in a Polars DataFrame.

Also added in some docs, and general clean up of the README. It was a bit long and redundant with the detailed docs.
bumping up to version 0.11.0
Should be noted that this new functionality is experimental and might change in the future

Some performance metrics to consider (taken from the issue below):

Some initial benchmarking with Polars vs Pandas.

16 CPUs
64 GB RAM

Polars

----------------------------------------------------------------------------------------------- benchmark: 5 tests ----------------------------------------------------------------------------------------------
Name (time in ms)              Min                     Max                    Mean                StdDev                  Median                   IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_1000                  22.0263 (1.0)           40.8150 (1.0)           26.6252 (1.0)          4.4042 (2.00)          25.4531 (1.0)          5.5076 (1.81)          2;1  37.5583 (1.0)          20           1
test_100_000              125.3079 (5.69)         132.8317 (3.25)         128.6364 (4.83)         2.2016 (1.0)          128.5384 (5.05)         3.0454 (1.0)           8;0   7.7738 (0.21)         20           1
test_10_000_000        13,111.6890 (595.28)    13,571.3354 (332.51)    13,313.8029 (500.04)     143.4687 (65.16)     13,288.9818 (522.10)     230.0904 (75.55)         8;0   0.0751 (0.00)         20           1
test_50_000_000        73,678.5405 (>1000.0)   83,990.9534 (>1000.0)   81,052.3603 (>1000.0)  2,857.3016 (>1000.0)   82,544.7946 (>1000.0)  3,770.1957 (>1000.0)       5;1   0.0123 (0.00)         20           1
test_100_000_000      154,867.3404 (>1000.0)  172,853.2556 (>1000.0)  166,977.0828 (>1000.0)  6,251.3802 (>1000.0)  170,382.6524 (>1000.0)  8,664.7339 (>1000.0)       5;0   0.0060 (0.00)         20           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Pandas

The 100M run caused the kernel to die, so it is omitted from the testing.

--------------------------------------------------------------------------------------------- benchmark: 4 tests --------------------------------------------------------------------------------------------
Name (time in ms)              Min                     Max                    Mean              StdDev                  Median                 IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_1000                  34.0919 (1.0)           89.0380 (1.0)           39.3168 (1.0)       13.3184 (1.10)          35.2189 (1.0)        1.2976 (1.0)           2;2  25.4344 (1.0)          20           1
test_100_000              239.1103 (7.01)         282.2044 (3.17)         247.6475 (6.30)      12.1546 (1.0)          243.1044 (6.90)       7.0099 (5.40)          3;3   4.0380 (0.16)         20           1
test_10_000_000        25,839.3843 (757.93)    26,444.5088 (297.00)    26,066.9272 (663.00)   191.5533 (15.76)     25,998.1581 (738.19)   315.2639 (242.96)        7;0   0.0384 (0.00)         20           1
test_50_000_000       135,741.7186 (>1000.0)  138,597.8106 (>1000.0)  136,727.2552 (>1000.0)  775.6511 (63.82)    136,540.1143 (>1000.0)  752.4071 (579.85)        4;2   0.0073 (0.00)         20           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Closes #257

fdosani · 2024-01-13T00:00:51Z

Should be ready for review now!

fdosani · 2024-01-15T22:03:32Z

Adding in some tweaks to help with cleaning up the code base using an ABC. Ref: #260

ak-gupta

First round of review. At a high level I think the code is good (but a lot to go through at once). As we discussed offline, I think there's an opportunity to use Polars' lazy query optimization throughout but it might be easier for me to PR into the branch because it would be a lot of little changes throughout.

ak-gupta · 2024-01-16T19:16:54Z

datacompy/polars.py

+        values don't match.
+    """
+    compare: pl.Series
+    try:


For this function, should we check the type first before trying matching logic? I.e. Something like this check to check if the first column is numeric?

Good call. I might tweak that a bit. I just took the existing pandas version and ported it.

@ak-gupta something like.. ?

if col_1.is_numeric() and col_2.is_numeric(): compare = pl.Series( np.isclose(col_1, col_2, rtol=rel_tol, atol=abs_tol, equal_nan=True) ) else: try:

Yeah! I think doing those checks might make the code a bit more efficient. Similarly, we could use the is_temporal check for datetime columns as well.

Maybe something like

if col_1.dtype.is_numeric() and col_1.dtype.is_numeric(): # Numeric check with float cast return ... either_temporal: bool = col_1.dtype.is_temporal() or col_2.dtype.is_temporal() for col in [col_1, col_2]: if str(col.dtype) in STRING_TYPE: # Ignore case and spaces if either_temporal: # Cast to datetime if either_temporal: # Datetime comparison else: # eq_missing comparison

Given that your current logic is a lift-and-shift implementation from Pandas we can mark the update as optional.

ak-gupta · 2024-01-16T19:20:14Z

datacompy/polars.py

+        compare = pl.Series(
+            np.isclose(col_1, col_2, rtol=rel_tol, atol=abs_tol, equal_nan=True)
+        )


Should we use Polars logic exclusively for a function like this? Maybe something along the lines of

( df .lazy() .with_columns( [ pl.col(col_1).is_null().alias("__DATACOMPY_NULL_COL_1"), pl.col(col_2).is_null().alias("__DATACOMPY_NULL_COL_2"), (pl.col(col_1) - pl.col(col_2)).abs().alias("__DATACOMPY_ABS_DIFF") ] ) .select( [ ( pl.when( (pl.col("__DATACOMPY_NULL_COL_1") == True) & (pl.col("__DATACOMPY_NULL_COL_2") == True) ) .then(True) .when( pl.col("__DATACOMPY_NULL_COL_1") != pl.col("__DATACOMPY_NULL_COL_2") ) .then(False) .when( (pl.col("__DATACOMPY_ABS_DIFF") <= rel_tol) & (pl.col("__DATACOMPY_ABS_DIFF") <= abs_tol) ) .then(True) .otherwise(False) ).alias("__DATACOMPY_COMPARE") ] ) .collect() .to_series() )

I did some very rough benchmarking of this function against what's in the PR already. At 100 000 rows I get

Polars Execution time (in s): 0.003081 +/- 0.000107 Numpy Execution time (in s): 0.005354 +/- 0.000219

at 1 000 000,

Polars Execution time (in s): 0.022880 +/- 0.000342 Numpy Execution time (in s): 0.048174 +/- 0.001574

Of course my logic would need to be further tested -- I made sure to assert the outputs at the end of the benchmarking but it's very much POC code.

re: the next step optimization

Let's tackle this point in a separate PR. Honestly, contributing an isclose style expression to Polars might be a better idea anyways.

See this issue in the Polars repository

fdosani · 2024-01-16T21:35:34Z

lazy query optimization throughout but it might be easier for me to PR into the branch because it would be a lot of little changes throughout.

Feel free to do that! If it is easier.

fdosani · 2024-01-17T15:47:46Z

Just noting a offline discussion with @ak-gupta that this PR is getting a bit big, so it might make sense to iterate on the Polars performance in quick follow up PRs as this is still experimental.

ak-gupta

Marked an additional optional change. Based on my review I don't see functional problems so I think we can move ahead with the "merge first, optimize second" approach.

ak-gupta · 2024-01-18T16:24:48Z

datacompy/polars.py

+        values don't match.
+    """
+    compare: pl.Series
+    try:


Maybe something like

if col_1.dtype.is_numeric() and col_1.dtype.is_numeric(): # Numeric check with float cast return ... either_temporal: bool = col_1.dtype.is_temporal() or col_2.dtype.is_temporal() for col in [col_1, col_2]: if str(col.dtype) in STRING_TYPE: # Ignore case and spaces if either_temporal: # Cast to datetime if either_temporal: # Datetime comparison else: # eq_missing comparison

Given that your current logic is a lift-and-shift implementation from Pandas we can mark the update as optional.

fdosani · 2024-01-18T16:34:30Z

Marked an additional optional change. Based on my review I don't see functional problems so I think we can move ahead with the "merge first, optimize second" approach.

Yup, I'm assigned with the merge and then optimize. Would be good to just have a starting base to work off. Going to merge.

* adding in first commit of Polars port * small fixes to reporting * renaming from core_polars to polars * updating usage docs for polars * updating docs and some clean up of README * adding in pytest.importorskip * adding in pytest.importorskip * fixing imports * fixing mypy and minor bugs * fixing polars test bugs * adding in abc class * using base and cleanup

fdosani added 5 commits January 9, 2024 23:10

adding in first commit of Polars port

a9a45c3

small fixes to reporting

0a99314

renaming from core_polars to polars

ebf7236

updating usage docs for polars

3d5c63d

updating docs and some clean up of README

6037f31

fdosani added the enhancement New feature or request label Jan 12, 2024

fdosani requested review from ak-gupta, jdawang, gladysteh99 and NikhilJArora as code owners January 12, 2024 16:20

fdosani added 5 commits January 12, 2024 11:29

adding in pytest.importorskip

1c42ee4

adding in pytest.importorskip

8707db1

fixing imports

b056c43

fixing mypy and minor bugs

a1a0e44

fixing polars test bugs

8c5533a

fdosani added 2 commits January 15, 2024 16:58

adding in abc class

e8fb80c

using base and cleanup

7adc6ab

ak-gupta reviewed Jan 16, 2024

View reviewed changes

ak-gupta approved these changes Jan 18, 2024

View reviewed changes

fdosani merged commit 01d2d8f into develop Jan 18, 2024
24 checks passed

fdosani deleted the core-polars branch February 2, 2024 15:06

fdosani mentioned this pull request Feb 21, 2024

Release v0.11.0 #271

Merged

fdosani mentioned this pull request Mar 5, 2024

Abstract base class for native Compare functionality #260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Polars support #258

Native Polars support #258

fdosani commented Jan 12, 2024

fdosani commented Jan 13, 2024

fdosani commented Jan 15, 2024

ak-gupta left a comment

ak-gupta Jan 16, 2024

fdosani Jan 16, 2024

fdosani Jan 16, 2024

ak-gupta Jan 18, 2024

ak-gupta Jan 18, 2024

ak-gupta Jan 16, 2024

ak-gupta Jan 18, 2024

ak-gupta Jan 18, 2024

fdosani commented Jan 16, 2024

fdosani commented Jan 17, 2024 •

edited

Loading

ak-gupta left a comment

ak-gupta Jan 18, 2024

fdosani commented Jan 18, 2024

Native Polars support #258

Native Polars support #258

Conversation

fdosani commented Jan 12, 2024