Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas on Spark refactor #275

Merged
merged 17 commits into from
Mar 25, 2024
13 changes: 7 additions & 6 deletions .github/workflows/test-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,10 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.8, 3.9, '3.10', '3.11']
spark-version: [3.1.3, 3.2.4, 3.3.4, 3.4.2, 3.5.0]
python-version: [3.9, '3.10', '3.11']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not testing on py38, I think this would warrant a formal declaration of dropping support + changing supported versions in the pyproject.toml. Learned this the hard way haha.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making changes in pyproject.toml

spark-version: [3.2.4, 3.3.4, 3.4.2, 3.5.1]
pandas-version: [2.2.1, 1.5.3]
exclude:
- python-version: '3.11'
spark-version: 3.1.3
- python-version: '3.11'
spark-version: 3.2.4
- python-version: '3.11'
Expand Down Expand Up @@ -51,6 +50,7 @@ jobs:
python -m pip install --upgrade pip
python -m pip install pytest pytest-spark pypandoc
python -m pip install pyspark==${{ matrix.spark-version }}
python -m pip install pandas==${{ matrix.pandas-version }}
python -m pip install .[dev]
- name: Test with pytest
run: |
Expand All @@ -62,7 +62,8 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.8, 3.9, '3.10', '3.11']
python-version: [3.9, '3.10', '3.11']

env:
PYTHON_VERSION: ${{ matrix.python-version }}

Expand All @@ -88,7 +89,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.8, 3.9, '3.10', '3.11']
python-version: [3.9, '3.10', '3.11']
env:
PYTHON_VERSION: ${{ matrix.python-version }}

Expand Down
48 changes: 38 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,44 @@ pip install datacompy[ray]

```

### In-scope Spark versions
Different versions of Spark play nicely with only certain versions of Python below is a matrix of what we test with
### Legacy Spark Deprecation

#### Starting with version 0.12.0

The original ``SparkCompare`` implementation differs from all the other native implementations. To align the API better, and keep behaviour consistent we are deprecating ``SparkCompare`` into a new module ``LegacySparkCompare``

If you wish to use the old SparkCompare moving forward you can

```python
import datacompy.legacy.LegacySparkCompare
```

#### Supported versions and dependncies

Different versions of Spark, Pandas, and Python interact differently. Below is a matrix of what we test with.
With the move to Pandas on Spark API and compatability issues with Pandas 2+ we will for the mean time note support Pandas 2
with the Pandas on Spark implementation. Spark plans to support Pandas 2 in [Spark 4](https://issues.apache.org/jira/browse/SPARK-44101)

With version ``0.12.0``:
- Not support Pandas ``2.0.0`` For the native Spark implemention
- Spark ``3.1`` support will be dropped
- Python ``3.8`` support is dropped


| | Spark 3.2.4 | Spark 3.3.4 | Spark 3.4.2 | Spark 3.5.1 |
|-------------|-------------|-------------|-------------|-------------|
| Python 3.9 | ✅ | ✅ | ✅ | ✅ |
| Python 3.10 | ✅ | ✅ | ✅ | ✅ |
| Python 3.11 | ❌ | ❌ | ✅ | ✅ |
| Python 3.12 | ❌ | ❌ | ❌ | ❌ |


| | Pandas < 1.5.3 | Pandas >=2.0.0 |
|---------------|----------------|----------------|
| Native Pandas | ✅ | ✅ |
| Native Spark | ✅ | ❌ |
| Fugue | ✅ | ✅ |

| | Spark 3.1.3 | Spark 3.2.3 | Spark 3.3.4 | Spark 3.4.2 | Spark 3.5.0 |
|-------------|--------------|-------------|-------------|-------------|-------------|
| Python 3.8 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Python 3.9 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Python 3.10 | ✅ | ✅ | ✅ | ✅ | ✅ |
| Python 3.11 | ❌ | ❌ | ❌ | ✅ | ✅ |
| Python 3.12 | ❌ | ❌ | ❌ | ❌ | ❌ |


> [!NOTE]
Expand All @@ -56,7 +84,7 @@ Different versions of Spark play nicely with only certain versions of Python bel
## Supported backends

- Pandas: ([See documentation](https://capitalone.github.io/datacompy/pandas_usage.html))
- Spark: ([See documentation](https://capitalone.github.io/datacompy/spark_usage.html))
- Spark (Pandas on Spark API): ([See documentation](https://capitalone.github.io/datacompy/spark_usage.html))
- Polars (Experimental): ([See documentation](https://capitalone.github.io/datacompy/polars_usage.html))
- Fugue is a Python library that provides a unified interface for data processing on Pandas, DuckDB, Polars, Arrow,
Spark, Dask, Ray, and many other backends. DataComPy integrates with Fugue to provide a simple way to compare data
Expand Down
4 changes: 2 additions & 2 deletions datacompy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = "0.11.3"
__version__ = "0.12.0"

from datacompy.core import *
from datacompy.fugue import (
Expand All @@ -25,4 +25,4 @@
unq_columns,
)
from datacompy.polars import PolarsCompare
from datacompy.spark import NUMERIC_SPARK_TYPES, SparkCompare
from datacompy.spark import SparkCompare
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add newline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Loading
Loading