Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Polars support #258

Merged
merged 12 commits into from
Jan 18, 2024
Merged

Native Polars support #258

merged 12 commits into from
Jan 18, 2024

Conversation

fdosani
Copy link
Member

@fdosani fdosani commented Jan 12, 2024

Native Polars Support

This PR adds functionality to provide native Polars support into datacompy. This is more or less a direct port of the Pandas version but using Polars. There are some nuances (like no index) but for the most part there is a lot of parity between the two. This is to compliment and not the same as the Fugue version which will run the underly logic using Pandas when passing in a Polars DataFrame.

  • Also added in some docs, and general clean up of the README. It was a bit long and redundant with the detailed docs.
  • bumping up to version 0.11.0
  • Should be noted that this new functionality is experimental and might change in the future

Some performance metrics to consider (taken from the issue below):

Some initial benchmarking with Polars vs Pandas.

  • 16 CPUs
  • 64 GB RAM

Polars

----------------------------------------------------------------------------------------------- benchmark: 5 tests ----------------------------------------------------------------------------------------------
Name (time in ms)              Min                     Max                    Mean                StdDev                  Median                   IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_1000                  22.0263 (1.0)           40.8150 (1.0)           26.6252 (1.0)          4.4042 (2.00)          25.4531 (1.0)          5.5076 (1.81)          2;1  37.5583 (1.0)          20           1
test_100_000              125.3079 (5.69)         132.8317 (3.25)         128.6364 (4.83)         2.2016 (1.0)          128.5384 (5.05)         3.0454 (1.0)           8;0   7.7738 (0.21)         20           1
test_10_000_000        13,111.6890 (595.28)    13,571.3354 (332.51)    13,313.8029 (500.04)     143.4687 (65.16)     13,288.9818 (522.10)     230.0904 (75.55)         8;0   0.0751 (0.00)         20           1
test_50_000_000        73,678.5405 (>1000.0)   83,990.9534 (>1000.0)   81,052.3603 (>1000.0)  2,857.3016 (>1000.0)   82,544.7946 (>1000.0)  3,770.1957 (>1000.0)       5;1   0.0123 (0.00)         20           1
test_100_000_000      154,867.3404 (>1000.0)  172,853.2556 (>1000.0)  166,977.0828 (>1000.0)  6,251.3802 (>1000.0)  170,382.6524 (>1000.0)  8,664.7339 (>1000.0)       5;0   0.0060 (0.00)         20           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Pandas

The 100M run caused the kernel to die, so it is omitted from the testing.

--------------------------------------------------------------------------------------------- benchmark: 4 tests --------------------------------------------------------------------------------------------
Name (time in ms)              Min                     Max                    Mean              StdDev                  Median                 IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_1000                  34.0919 (1.0)           89.0380 (1.0)           39.3168 (1.0)       13.3184 (1.10)          35.2189 (1.0)        1.2976 (1.0)           2;2  25.4344 (1.0)          20           1
test_100_000              239.1103 (7.01)         282.2044 (3.17)         247.6475 (6.30)      12.1546 (1.0)          243.1044 (6.90)       7.0099 (5.40)          3;3   4.0380 (0.16)         20           1
test_10_000_000        25,839.3843 (757.93)    26,444.5088 (297.00)    26,066.9272 (663.00)   191.5533 (15.76)     25,998.1581 (738.19)   315.2639 (242.96)        7;0   0.0384 (0.00)         20           1
test_50_000_000       135,741.7186 (>1000.0)  138,597.8106 (>1000.0)  136,727.2552 (>1000.0)  775.6511 (63.82)    136,540.1143 (>1000.0)  752.4071 (579.85)        4;2   0.0073 (0.00)         20           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Closes #257

@fdosani fdosani added the enhancement New feature or request label Jan 12, 2024
@fdosani
Copy link
Member Author

fdosani commented Jan 13, 2024

Should be ready for review now!

@fdosani
Copy link
Member Author

fdosani commented Jan 15, 2024

Adding in some tweaks to help with cleaning up the code base using an ABC. Ref: #260

Copy link

@ak-gupta ak-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review. At a high level I think the code is good (but a lot to go through at once). As we discussed offline, I think there's an opportunity to use Polars' lazy query optimization throughout but it might be easier for me to PR into the branch because it would be a lot of little changes throughout.

values don't match.
"""
compare: pl.Series
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this function, should we check the type first before trying matching logic? I.e. Something like this check to check if the first column is numeric?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I might tweak that a bit. I just took the existing pandas version and ported it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ak-gupta something like.. ?

    if col_1.is_numeric() and col_2.is_numeric():
        compare = pl.Series(
            np.isclose(col_1, col_2, rtol=rel_tol, atol=abs_tol, equal_nan=True)
        )
    else:
        try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I think doing those checks might make the code a bit more efficient. Similarly, we could use the is_temporal check for datetime columns as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like

if col_1.dtype.is_numeric() and col_1.dtype.is_numeric():
    # Numeric check with float cast
   return ...

either_temporal: bool = col_1.dtype.is_temporal() or col_2.dtype.is_temporal()
for col in [col_1, col_2]:
    if str(col.dtype) in STRING_TYPE:
        # Ignore case and spaces
        if either_temporal:
            # Cast to datetime

if either_temporal:
    # Datetime comparison
else:
    # eq_missing comparison

Given that your current logic is a lift-and-shift implementation from Pandas we can mark the update as optional.

Comment on lines +786 to +788
compare = pl.Series(
np.isclose(col_1, col_2, rtol=rel_tol, atol=abs_tol, equal_nan=True)
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Polars logic exclusively for a function like this? Maybe something along the lines of

(
        df
            .lazy()
            .with_columns(
                [
                    pl.col(col_1).is_null().alias("__DATACOMPY_NULL_COL_1"),
                    pl.col(col_2).is_null().alias("__DATACOMPY_NULL_COL_2"),
                    (pl.col(col_1) - pl.col(col_2)).abs().alias("__DATACOMPY_ABS_DIFF")
                ]
            )
            .select(
                [
                    (
                        pl.when(
                            (pl.col("__DATACOMPY_NULL_COL_1") == True) & (pl.col("__DATACOMPY_NULL_COL_2") == True)
                        )
                        .then(True)
                        .when(
                            pl.col("__DATACOMPY_NULL_COL_1") != pl.col("__DATACOMPY_NULL_COL_2")
                        )
                        .then(False)
                        .when(
                            (pl.col("__DATACOMPY_ABS_DIFF") <= rel_tol) & (pl.col("__DATACOMPY_ABS_DIFF") <= abs_tol)
                        )
                        .then(True)
                        .otherwise(False)
                    ).alias("__DATACOMPY_COMPARE")
                ]
            )
            .collect()
            .to_series()
    )

I did some very rough benchmarking of this function against what's in the PR already. At 100 000 rows I get

Polars Execution time (in s): 0.003081 +/- 0.000107
Numpy Execution time (in s): 0.005354 +/- 0.000219

at 1 000 000,

Polars Execution time (in s): 0.022880 +/- 0.000342
Numpy Execution time (in s): 0.048174 +/- 0.001574

Of course my logic would need to be further tested -- I made sure to assert the outputs at the end of the benchmarking but it's very much POC code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: the next step optimization

Let's tackle this point in a separate PR. Honestly, contributing an isclose style expression to Polars might be a better idea anyways.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this issue in the Polars repository

@fdosani
Copy link
Member Author

fdosani commented Jan 16, 2024

lazy query optimization throughout but it might be easier for me to PR into the branch because it would be a lot of little changes throughout.

Feel free to do that! If it is easier.

@fdosani
Copy link
Member Author

fdosani commented Jan 17, 2024

Just noting a offline discussion with @ak-gupta that this PR is getting a bit big, so it might make sense to iterate on the Polars performance in quick follow up PRs as this is still experimental.

Copy link

@ak-gupta ak-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked an additional optional change. Based on my review I don't see functional problems so I think we can move ahead with the "merge first, optimize second" approach.

values don't match.
"""
compare: pl.Series
try:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like

if col_1.dtype.is_numeric() and col_1.dtype.is_numeric():
    # Numeric check with float cast
   return ...

either_temporal: bool = col_1.dtype.is_temporal() or col_2.dtype.is_temporal()
for col in [col_1, col_2]:
    if str(col.dtype) in STRING_TYPE:
        # Ignore case and spaces
        if either_temporal:
            # Cast to datetime

if either_temporal:
    # Datetime comparison
else:
    # eq_missing comparison

Given that your current logic is a lift-and-shift implementation from Pandas we can mark the update as optional.

@fdosani
Copy link
Member Author

fdosani commented Jan 18, 2024

Marked an additional optional change. Based on my review I don't see functional problems so I think we can move ahead with the "merge first, optimize second" approach.

Yup, I'm assigned with the merge and then optimize. Would be good to just have a starting base to work off. Going to merge.

@fdosani fdosani merged commit 01d2d8f into develop Jan 18, 2024
24 checks passed
@fdosani fdosani deleted the core-polars branch February 2, 2024 15:06
@fdosani fdosani mentioned this pull request Feb 21, 2024
rhaffar pushed a commit to rhaffar/datacompy that referenced this pull request Sep 12, 2024
* adding in first commit of Polars port

* small fixes to reporting

* renaming from core_polars to polars

* updating usage docs for polars

* updating docs and some clean up of README

* adding in pytest.importorskip

* adding in pytest.importorskip

* fixing imports

* fixing mypy and minor bugs

* fixing polars test bugs

* adding in abc class

* using base and cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Look into porting Compare to a polars backend for performance testing.
2 participants