Releases: modin-project/modin
Modin 0.19.0
Modin 0.19.0
This release introduces Modin's new, experimental NumPy API. It also features
many bug fixes, improvements to documentation, and performance optimizations,
including faster initialization with NumPy arrays.
Key Features and Updates Since 0.18.0
- Stability and Bugfixes
- FIX-#0000: Fix a typo in
expr.py
(#5757) - FIX-#1227: Avoid
RecursionError
for__int__
and__float__
(#5502) - FIX-#1503: Proper implementation of
Series.values
(#5469) - FIX-#2320: Raise exceptions in read_csv in some cases with
skipfooter!=0
(#5522) - FIX-#2493: Defaults to pandas for read_csv if lineterminator!=None (#5515)
- FIX-#2494: Defaults to pandas for read_csv if escapechar!=None (#5521)
- FIX-#2508: Defaults to pandas for read_csv if
dialect!=None
(#5512) - FIX-#3080: read_csv with HDK backend doesn't handle duplicated columns (#5639)
- FIX-#3305: Fix
read_excel
whenusecols
andindex_cols
parameters are provided (#5508) - FIX-#3620: Fix construction of dataframe from index (#5490)
- FIX-#3928: Fix column insertion into empty data frame (#5103)
- FIX-#4154: add value_counts method for SeriesGroupBy and DataFrameGroupBy (#5453)
- FIX-#4186: Fix
__repr__
of Modin categorical Series (#5516) - FIX-#4640: Fix
__repr__
whendisplay.max_rows=None
(#5504) - FIX-#5165: make 'groupby' handle non-str 'by' columns (#5411)
- FIX-#5273: Make
ParquetFileToRead
a named tuple (#5352) - FIX-#5430: Make groupby work on empty frames (#5442)
- FIX-#5436: Fix '.index' extraction for an empty frame (#5431)
- FIX-#5473: Fixed a bug that ignored positional arguments in
DataFrameGroupBy.take()
(#5474) - FIX-#5477: Fix TypeError: read_sas() takes 1 positional argument but 2 were given (#5465)
- FIX-#5488: Remove usage of deprecated numpy types (#5487)
- FIX-#5492: Fix
Series.values
whenSeries.dtype==ExtensionDtype
(#5493) - FIX-#5514: pin sphinx<6.0.0 (#5513)
- FIX-#5531: Fix failure when inserting a 2D python list into a frame (#5555)
- FIX-#5537: disable empty-groupby handling logic in experimental mode (#5538)
- FIX-#5539: Allow partitioning to adapt to the shape changes caused by '.merge' (#5556)
- FIX-#5545: Aligned with pandas default 'groupby.skew' results for invalid data (#5558)
- FIX-#5552: Fix sort_values when data is over-partitioned. (#5553)
- FIX-#5561: CalciteSerializer does not support unsigned integers (#5563)
- FIX-#5568: Pin 'fastparquet<2023.1.0' (#5569)
- FIX-#5581: Don't use deprecated
inplace
parameter forset_axis
function (#5579) - FIX-#5589: Do not trigger metadata materialization on 'filter' (#5588)
- FIX-#5597: pin sqlalchemy<1.4.46 as pandas does to fix CI (#5593)
- FIX-#5598: make
PyArrowDataset.files
work for3.0.0 <= pyarrow < 8.0.0
(#5592) - FIX-#5600: Copy '.dtypes' on 'df.copy()' (#5601)
- FIX-#5604: Fix dictionary groupby aggregation for a single col partition case (#5605)
- FIX-#5608: Pin openpyxl<3.1.0 (#5603)
- FIX-#5610: Add default to pandas implementation for qcut (#5611)
- FIX-#5621: Do not preserve suboptimal partitioning on
keep_partitioning=False
(#5622) - FIX-#5625: Fix set_index with modin series. (#5630)
- FIX-#5628: BUG: HDK: Unable to concatenate tables with different number of non-numeric columns (#5673)
- FIX-#5629: Make read_sql alias compatible with snowflake. (#5631)
- FIX-#5650: Restore the right dtype for applying Series.cat (#5651)
- FIX-#5665: Fix operations that flatten an array, as well as handling of where argument in such operations (#5668)
- FIX-#5698: Read list of parquet files (#5725)
- FIX-#5702: Fix passing RangeIndex to loc. (#5719)
- FIX-#5714: BUG: Empty frames concatenation with inner join is not valid (#5715)
- FIX-#5720: Ensure that modin.numpy.array's propagate NaN values when computing mean (#5735)
- FIX-#5721: Fix loc[tuple] on multiindex. (#5726)
- FIX-#5730: Add repr, len, size, and make dtype changing lazy. (#5731)
- FIX-#5733: Allow all Modin objects in all Modin object constructors, and make sure copy=False works (#5736)
- FIX-#5742: BUG: HDK: Binary operations on strings are not supported (#5743)
- FIX-#5761: Add _exp, _sqrt to query compiler (#5762)
- FIX-#0000: Fix a typo in
- Performance enhancements
- PERF-#5182: Precompute dtypes when performing binary operations in certain cases (#5494)
- PERF-#5183: Compute dtypes when performing from_labels operation (#5478)
- PERF-#5247: Make MultiIndex use memory more efficiently (#5632)
- PERF-#5369:
GroupBy.skew
implementation via MapReduce pattern (#5318) - PERF-#5484: speed up read_csv; compute metadata after skipping rows (#5482)
- PERF-#5549: copy dtypes for invert op (#5541)
- PERF-#5550: Don't trigger axes computation in
to_pandas
function (#5544) - PERF-#5551: Preserve index and columns on
_repartition
(#5543) - PERF-#5554: Implement
drop_duplicates
via newduplicated
(#5587) - PERF-#5557: Don't trigger axes computation in
pivot_table
(#5546) - PERF-#5573: Don't trigger axes computation in
columnarize
function (#5548) - PERF-#5575: Don't trigger axes computation in
reset_index
function (#5547) - PERF-#5586: Precompute resulting '.merge' partitioning based on the arguments (#5585)
- PERF-#5589: Do no trigger 'dtypes' materialization for '.filter()' (#5595)
- PERF-#5596: Do not trigger index materialization for '.merge' result (#5619)
- PERF-#5613: Optimize
duplicated
in case there is only one column partition (#5640) - FIX-#5641: Add fastpath for numpy arrays to dataframe constructor (#5655)
- PERF-#5657: Don't trigger axes computation when accessing
.str.*
methods (#5658) - PERF-#5660: Don't trigger axes computation when accessing cat.codes (#5661)
- PERF-#5680: Don't trigger axes computation when doing binary operations (#5681)
- PERF-#5682: Don't trigger axes computation when calling
isin
(#5683) - PERF-#5690: move
read_callback
from dispatchers into parsers (#5689) - PERF-#5691: Set item via
.loc
without converting a Series to np.array (#5693) - PERF-#5700: Treat numpy arrays more efficiently at
df.__setitem__
(#5708) - PERF-#5705: Preserve metadata when applying
Series.cat.codes
(#5706) - PERF-#5709: Avoid re-putting a distributed Series to the engine's object store at
.map()
(#5704) - PERF-#5710: Avoid re-putting a distributed Series to the engine's object store at
.isin()
(#5707)
- Refactor Codebase
- REFACTOR-#0000: make deploy functions in virtual_partition.py files private (#5455)
- REFACTOR-#1531: move
default_to_pandas
into base query_compiler class (#5479) - REFACTOR-#3883: Unify tests execution approach in the Github workflow files (#5520)
- REFACTOR-#3948: Use
__constructor__
inDataFrame
andSeries
classes (#5485) - REFACTOR-#5275: Deduplicate code for Ray and Unidist engines (#5457)
- REFACTOR-#5370: Move merge_asof implementation to base query compiler. (#5371)
- REFACTOR-#5393: remove unused '_VIEW_IS_COPY_WARNING' global var (#5392)
- REFACTOR-#5416: fix
FutureWarning: the mangle_dupe_cols keyword is deprecated
forread_excel
(#5415) - REFACTOR-#5434: Define public interfaces in
modin.core.execution.dask
module (#5418) - REFACTOR-#5459: Install code linters through conda and unpin flake8 (#5450)
- REFACTOR-#5462: Update execution.ray public api with virtual partitions (#5456)
- REFACTOR-#5467: remove FutureWarning for
df.iloc[:, i] = newvals
(#5468) - REFACTOR-#5471: add
FutureWarning
forDataFrameGroupBy.backfill
(#5472) - REFACTOR-#5475: Update execution.unidist public api with virtual partitions (#5476)
- REFACTOR-#5535: remove duplication for 'columnarize' method (#5534)
- REFACTOR-#5607: Fix missing formatting with 'black' (#5606)
- REFACTOR-#5685: add
RayWrapper.put
implementation (#5686) - REFACTOR-#5687: add
UnidistWrapper.put
implementation (#5688) - REFACTOR-#5703: align 'DaskWrapper.deploy' behavior with others (#5701)
- REFACTOR-#5718: add
columns
parameter forget_dtypes
function (#5717)
- Update testing suite
- TEST-#0000: correct behavior of CI for push action (#5748)
- TEST-#5420: port asv benchmarks for Repr, MaskBool, isNull, dropNa and equals functions (#5421)
- TEST-#5444: reduce Series' shape for TimeReindex asv bench (#5443)
- TEST-#5448: reduce Dataframe' shape for 'time_merge_default' asv bench (#5446)
- TEST-#5451: reduce shapes for TimeLevelAlign, TimeStack and TimeUnstack ASV benchmarks (#5452)
- TEST-#5540: add module level setup function for ASV benchmarks (#5530)
- TEST-#5664: speedup
Post Run conda-incubator/setup-miniconda@v2
step on Windows (#5662) - TEST-#5747: Synchronize jobs between push.yml and ci.yml that are used to measure test coverage (#5745)
- TEST-#5764: run test-asv-benchmarks CI job only for PRs (#5765)
- Documentation improvements
- New Features
- FEAT-#5147: implement xs (#5143)
- FEAT-#5423: Add a NumPy API to Modin (#5422)
- FEAT-#5481: Implement dictionary groupby aggregation via TreeReduce (#5503)
- FEAT-#5559: Upgrade pandas to 1.5.3 (#5560)
- FEAT-#5562: Upgrade pyhdk to 0.3.1 (#5564)
- FEAT-#5620: Synchronize parameters of
apply_full_axis
withbroadcast_apply_full_axis
(#5637) - FEAT-#5666: Support logic operations on modin numpy arrays (#5667)
- FEAT-#5751: Bump pyhdk version to 0.4 (#5752)
- FEAT-#5753: Add math functions necessary for picoGPT (#5756)
- FEAT-#5754: Add np.linalg operations (#5755)
Contributors
Modin 0.18.1
Modin 0.18.1
This release includes pandas 1.5.3 support and a bunch of bug fixes.
Key Features and Updates Since 0.18.0
- Stability and Bugfixes
- FIX-#1227: Avoid
RecursionError
for__int__
and__float__
(#5502) - FIX-#1503: Proper implementation of
Series.values
(#5469) - FIX-#2320: Raise exceptions in read_csv in some cases with
skipfooter!=0
(#5522) - FIX-#2493: Defaults to pandas for read_csv if lineterminator!=None (#5515)
- FIX-#2494: Defaults to pandas for read_csv if escapechar!=None (#5521)
- FIX-#2508: Defaults to pandas for read_csv if
dialect!=None
(#5512) - FIX-#3620: Fix construction of dataframe from index (#5490)
- FIX-#3928: Fix column insertion into empty data frame (#5103)
- FIX-#4186: Fix
__repr__
of Modin categorical Series (#5516) - FIX-#5165: make 'groupby' handle non-str 'by' columns (#5411)
- FIX-#5273: Make
ParquetFileToRead
a named tuple (#5352) - FIX-#5436: Fix '.index' extraction for an empty frame (#5431)
- FIX-#5473: Fixed a bug that ignored positional arguments in
DataFrameGroupBy.take()
(#5474) - FIX-#5477: Fix TypeError: read_sas() takes 1 positional argument but 2 were given (#5465)
- FIX-#5488: Remove usage of deprecated numpy types (#5487)
- FIX-#5492: Fix
Series.values
whenSeries.dtype==ExtensionDtype
(#5493) - FIX-#5514: pin sphinx<6.0.0 (#5513)
- FIX-#5531: Fix failure when inserting a 2D python list into a frame (#5555)
- FIX-#5568: Pin 'fastparquet<2023.1.0' (#5569)
- FIX-#1227: Avoid
- New Features
Contributors
@AndreyPavlenko
@YarShev
@anmyachev
@dchigarev
@vnlitvinov
@Retribution98
Modin 0.18.0
This release includes support for MPI backend using Unidist, improvements to the shuffling mechanism,
SQL query execution on the HDK backend (currently pyhdk==0.3), support for pandas 1.5.2 and external query compilers.
It also includes many bug fixes and some performance enhancements.
Key Features and Updates Since 0.17.0
- Stability and Bugfixes
- FIX-#3823: Fix TypeError when creating Series from SparseArray (#5377)
- FIX-#4100: Fall back to Pandas on row drop (#4937)
- FIX-#4636: Allows
read_parquet
to detect column partitioning in non-local filesystems (#5192) - FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#4864)
- FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#5271)
- FIX-#5016: Suppress spammy ray task errors. (#5298)
- FIX-#5114: Change mask name to resolve namespace conflict with numpy mask (#5215)
- FIX-#5137:
df.info
failure with default columns (#5251) - FIX-#5138:
df_categories_equals
typo (#5250) - FIX-#5171: Allow xgboost >= 1.7.0. (#5195)
- FIX-#5186:
set_index
case with multiindex (#5190) - FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
- FIX-#5204: Fix binary operations with a dictionary (#5205)
- FIX-#5208: Support
ray==2.1.0
(#5283) - FIX-#5232: Stop changing original series names during binary ops. (#5249)
- FIX-#5234: Use query compiler str_repeat. (#5235)
- FIX-#5236: Allow binary operations with custom classes. (#5237)
- FIX-#5238: Make rmul really rmul instead of mul. (#5246)
- FIX-#5240: Fix dask[complete] syntax in conda environment files (#5241)
- FIX-#5252: Disable notebook tests until access control issues are resolved for
modin-test
bucket (#5257) - FIX-#5277: Fix internal
execute
function (#5278) - FIX-#5284: Move ray, redis, tqdm, xgboost packages from pip to conda deps (#5270)
- FIX-#5285: Check for both pyarrow and fastparquet when read parquet format (#5297)
- FIX-#5306: Fix code scanning alert - Use of the return value of a procedure (#5307)
- FIX-#5308: Allow custom execution with no known engine. (#5379)
- FIX-#5319: Do not use deprecated '.iteritems()' (#5320)
- FIX-#5325: Fix
read_csv_glob
with non-emptyparse_dates
dict (#5339) - FIX-#5327: Bump mypy cap to fix CI. (#5328)
- FIX-#5364: Fix
get_indices
internal function (#5355) - FIX-#5380: Fix warning about setting _cache attribute. (#5381)
- FIX-#5398: Resolve length 1 nonNA partition issue, and off by one error in sort (#5400)
- FIX-#5405: Pin
ray>=1.13.0
(#5390)
- Performance enhancements
- Refactor Codebase
- REFACTOR-#5202: Pass loc arguments to query compiler. (#5305)
- REFACTOR-#5262: Update the examples to the latest version of the omniscripts (#5263)
- REFACTOR-#5287: Remove code to test getting TypeError for Series.dropna (#5288)
- REFACTOR-#5294: Fix code scanning alert - Potentially uninitialized local variable (#5383)
- REFACTOR-#5299:
Variable defined multiple times
error found by CodeQL (#5300) - REFACTOR-#5301: Fix code scanning alert - Duplicate key in dict literal (#5302)
- REFACTOR-#5303: Fix code scanning alert - Unused local variable (#5304)
- REFACTOR-#5310: Remove some hasattr('columns') checks. (#5311)
- REFACTOR-#5312: Let lazy query compilers check for astype and drop errors. (#5313)
- REFACTOR-#5322: Remove python3.7 related code from read_csv_glob (#5323)
- REFACTOR-#5330: Remove
BaseIO._read
(#5329) - REFACTOR-#5332: Define
PQ_INDEX_REGEX
as class variable (#5333) - REFACTOR-#5334: Make
_validate
as classmethod (#5331) - REFACTOR-#5335: Remove unnecessary lambdas (#5336)
- REFACTOR-#5359: Fix code scanning alert - File is not always closed (#5362)
- REFACTOR-#5363: Introduce partition constructor; move
add_to_apply_calls
impl in base class (#5354) - REFACTOR-#5382: Use
pandas.util.cache_readonly
for__constructors__
(#5368) - REFACTOR-#5386: Move partition.split implementation in base class (#5384)
- REFACTOR-#5391: Improve setup function in TimeDropDuplicatesDataframe (#5389)
- REFACTOR-#5413: Check
Index.dtype
instead ofisinstance(obj, Int64Index)
(#5406)
- Update testing suite
- TEST-#2073: Check that read_csv can use a parse_dates dict. (#4572)
- TEST-#4562: In windows CI, try to start ray a few times (#5101)
- TEST-#4821: Monkeypatch
cache_readonly
to avoid errors indoc_checker.py
(#5365) - TEST-#5123: Add CodeQL workflow for GitHub code scanning (#5222)
- TEST-#5219: Relax matplotlib and coverage pins (#5216)...
Modin 0.17.1
This release includes pandas 1.5.2 support and a bunch of bug fixes.
Key Features and Updates Since 0.17.0
- Stability and Bugfixes
- FIX-#4100: Fall back to Pandas on row drop (#4937)
- FIX-#4636: allows
read_parquet
to detect column partitioning in non-local filesystems (#5192) - FIX-#5138: df_categories_equals typo (#5250)
- FIX-#5186:
set_index
case with multiindex (#5190) - FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
- FIX-#5204: fix binary operations with a dictionary (#5205)
- FIX-#5232: Stop changing original series names during binary ops. (#5249)
- FIX-#5234: Use query compiler str_repeat. (#5235)
- FIX-#5236: Allow binary operations with custom classes. (#5237)
- FIX-#5252: Disable notebook tests until access control issues are resolved for
modin-test
bucket (#5257)
- New Features
Contributors
@AndreyPavlenko
@Billy2551
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@noloerino
Modin 0.17.0
This release includes support for pyhdk 0.2. It also includes many bug fixes and some performance enhancements.
Key Features and Updates Since 0.16.0
- Stability and Bugfixes
- FIX-#3764: Ensure df.loc with a scalar out of bounds appends to df (#3765)
- FIX-#4016, FIX-#4086, FIX-#4039: Fall back to pandas in case of duplicate column names (#4896)
- FIX-#4023: Fall back to pandas in case of MultiIndex columns (#5149)
- FIX-#4660: Fix
fillna
when Modin series object is an argument (#4674) - FIX-#5034: Handle lists in
df.get()
(#5035) - FIX-#5097: Stop using deprecated mangle_dup_cols. (#5104)
- FIX-#5098: Stop using append internally. (#5100)
- FIX-#5099: Fix
PandasQueryCompiler.groupby_mean
with timestamp in by (#5140) - FIX-#5112: allows empty partition to be passed into
query_compiler.dt_prop_map
(#5133) - FIX-#5128: Fix reading parquet directory from s3. (#5129)
- FIX-#5150: Sync row labels after read_csv when index_col is False (#5151)
- FIX-#5158: Synchronize metadata before
to_parquet
(#5161) - FIX-#5168: module 'collections' has no attribute 'Sequence' in dataframe protocol (#5169)
- FIX-#5174: Pin xgboost < 1.7. (#5175)
- FIX-#5180: Do not set OMP_NUM_THREADS=1 on modin.pandas init (#5181)
- FIX-#5184: Fix
get_dummies
to respect passed columns to be encoded (#5185) - FIX-#5188: Fix
getitem_bool
when the key is Series with empty partition (#5189) - FIX-#5206: pin mypy<0.990 (#5207)
- FIX-#5208: pin ray version under 2.1.0 (#5209)
- Performance enhancements
- Refactor Codebase
- Update testing suite
- Benchmarking enhancements
- Documentation improvements
- New Features
Contributors
@AndreyPavlenko
@Billy2551
@RehanSD
@YarShev
@anmyachev
@dchigarev
@devin-petersohn
@ienkovich
@mvashishtha
@noloerino
@pyrito
@rosdyana
@shalearkane
@suhailrehman
@vnlitvinov
Modin 0.16.2
This release includes pandas 1.5.1 support and two bug fixes.
Key features and Updates
- Stability and Bugfixes
- New Features
Contributors
Modin 0.16.1
This release features a bug fix, as well as fixes for deprecation warnings introduced by pandas 1.5.
Key Features and Updates
- Stability and Bugfixes
- Refactor Codebase
Contributors
Modin 0.16.0
This release includes support for pandas 1.5, support for the latest version of dask, and backwards compatibility with python 3.6 and pandas 1.1. Additionally, it includes many performance enhancements, bug fixes, and documentation improvements.
Key Features and Updates
- Stability and Bugfixes
- FIX-#4570: Replace
np.bool
->np.bool_
(#4571) - FIX-#4543: Fix
read_csv
in case skiprows=<0, []> (#4544) - FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
- FIX-#4589: Pin protobuf<4.0.0 to fix ray (#4590)
- FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)
- FIX-#4411: Fix binary_op between datetime64 Series and pandas timedelta (#4592)
- FIX-#4604: Fix
groupby
+agg
in case when multicolumn can arise (#4642) - FIX-#4582: Inherit custom log layer (#4583)
- FIX-#4639: Fix
storage_options
usage forread_csv
andread_csv_glob
(#4644) - FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
- FIX-#4584: Enable pdb debug when running cloud tests (#4585)
- FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)
- FIX-#4641: Reindex pandas partitions in
df.describe()
(#4651) - FIX-#2064: Fix
iloc
/loc
assignment when dataframe is empty (#4677) - FIX-#4634: Check for FrozenList as
by
indf.groupby()
(#4667) - FIX-#4680: Fix
read_csv
that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681) - FIX-#4491: Wait for all partitions in parallel in benchmark mode (#4656)
- FIX-#4358: MultiIndex
loc
shouldn't drop levels for full-key lookups (#4608) - FIX-#4658: Expand exception handling for
read_*
functions from s3 storages (#4659) - FIX-#4672: Fix incorrect warning when setting
frame.index
orframe.columns
(#4721) - FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
- FIX-#4652: Support categorical data in
from_dataframe
(#4737) - FIX-#4756: Correctly propagate
storage_options
inread_parquet
(#4764) - FIX-#4657: Use
fsspec
for handling s3/http-like paths instead ofs3fs
(#4710) - FIX-#4676: drain sub-virtual-partition call queues (#4695)
- FIX-#4782: Exclude certain non-parquet files in
read_parquet
(#4783) - FIX-#4808: Set dtypes correctly after column rename (#4809)
- FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
- FIX-#4099: Use mangled column names but keep the original when building frames from arrow (#4767)
- FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
- FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
- FIX-#4835: Handle Pathlike paths in
read_parquet
(#4837) - FIX-#4872: Stop checking the private ray mac memory limit (#4873)
- FIX-#4914:
base_lengths
should be computed frombase_frame
instead ofself
incopartition
(#4915) - FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
- FIX-#4927: Fix
dtypes
computation indataframe.filter
(#4928) - FIX-#4907: Implement
radd
for Series and DataFrame (#4908) - FIX-#4945: Fix
_take_2d_positional
that loses indexes due to filtering empty dataframes (#4951) - FIX-#4818, PERF-#4825: Fix where by using the new n-ary operator (#4820)
- FIX-#3983: FIX-#4107: Materialize 'rowid' columns when selecting rows by position (#4834)
- FIX-#4845: Fix KeyError from
__getitem_bool
for single row dataframes (#4845) - FIX-#4734: Handle Series.apply when return type is a DataFrame (#4830)
- FIX-#4983: Set
frac
toNone
in _sample whenn=0
(#4984) - FIX-#4993: Return
_default_to_pandas
indf.attrs
(#4995) - FIX-#5043: Fix
execute
function in ASV utils failed iflen(partitions) == 0
(#5044) - FIX-#4597: Refactor Partition handling of func, args, kwargs (#4715)
- FIX-#4996: Evaluate BenchmarkMode at each function call (#4997)
- FIX-#4022: Fixed empty data frame with index (#4910)
- FIX-#4090: Fixed check if the index is trivial (#4936)
- FIX-#4966: Fix
to_timedelta
to return Series instead of TimedeltaIndex (#5028) - FIX-#5042: Fix series getitem with invalid strings (#5048)
- FIX-#4691: Fix binary operations between virtual partitions (#5049)
- FIX-#5045: Fix ray virtual_partition.wait with duplicate object refs (#5058)
- FIX-#4570: Replace
- Performance enhancements
- PERF-#4182: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
- PERF-#4288: Improve perf of
groupby.mean
for narrow data (#4591) - PERF-#4772: Remove
df.copy
call fromfrom_pandas
since it is not needed for Ray and Dask (#4781) - PERF-#4325: Improve perf of multi-column assignment in
__setitem__
when no new column names are assigning (#4455) - PERF-#3844: Improve perf of
drop
operation (#4694) - PERF-#4727: Improve perf of
concat
operation (#4728) - PERF-#4705: Improve perf of arithmetic operations between
Series
objects with shared.index
(#4689) - PERF-#4703: Improve performance in accessing
ser.cat.categories
,ser.cat.ordered
, andser.__array_priority__
(#4704) - PERF-#4305: Parallelize
read_parquet
over row groups (#4700) - PERF-#4773: Compute
lengths
andwidths
input
method of Dask partition like Ray do (#4780) - PERF-#4732: Avoid overwriting already-evaluated
PandasOnRayDataframePartition._length_cache
andPandasOnRayDataframePartition._width_cache
(#4754) - PERF-#4862: Don't call
compute_sliced_len.remote
whenrow_labels/col_labels == slice(None)
(#4863) - PERF-#4713: Stop overriding the ray MacOS object store size limit (#4792)
- PERF-#4944: Avoid default_to_pandas in
Series.cat.codes
,Series.dt.tz
, andSeries.dt.to_pytimedelta
(#4833) - PERF-#4851: Compute
dtypes
for binary operations that can only return bool type and the right operand is not a Modin object (#4852) - PERF-#4842:
copy
should not trigger any previous computations (#4843) - PERF-#4849: Compute
dtypes
inconcat
also for ROW_WISE case when possible (#4850) - PERF-#4929: Compute
dtype
when usingSeries.dt
accessor (#4930) - PERF-#4892: Compute
lengths
inrebalance_partitions
when possible (#4893) - PERF-#4794: Compute caches in
_propagate_index_objs
(#4888) - PERF-#4860:
PandasDataframeAxisPartition.deploy_axis_func
should be serialized only once (#4861) - PERF-#4890:
PandasDataframeAxisPartition.drain
should be serialized only once (#4891) - PERF-#4870: Avoid index materialization in
__getattribute__
and__getitem__
(4911) - PERF-#4886: Use lazy index and columns evaluation in
query
method (#4887) - PERF-#4866:
iloc
function that used inpartition.mask
should be serialized only once (#4901) - PERF-#4920: Avoid index and cache computations in
take_2d_labels_or_positional
unless they are needed (#4921) - PERF-#4999: don't call
apply
in virtual partition'drain_call_queue
ifcall_queue
is empty (#4975) - PERF-#4268: Implement partition-parallel getitem for bool Series masks (#4753)
- PERF-#5017:
reset_index
shouldn't trigger index materialization if possible (#5018) - PERF-#4963: Use partition
width/length
methods instead of_compute_axis_labels_and_lengths
if index is already known (#4964) - PERF-#4940: Optimize categorical dtype check in
concatenate
(#4953)
- Benchmarking enhancements
- TEST-#5066: Add outer join case for
TimeConcat
benchmark (#5067) - TEST-#5083: Add
merge
op with categorical data (#5084) - FEAT-#4706: Add Modin ClassLogger to PandasDataframePartitionManager (#4707)
- TEST-#5014: Simplify adding new ASV benchmarks (#5015)
- TEST-#5064: Update
TimeConcat
benchmark with new parameterignore_index
(#5065) - TEST-#5068: Add binary op benchmark for Series (#5069)
- TEST-#5066: Add outer join case for
- Refactor Codebase
- REFACTOR-#4530: Standardize access to physical data in partitions (#4563)
- REFACTOR-#4534: Replace logging meta class with class decorator (#4535)
- REFACTOR-#4708: Delete combine dtypes (#4709)
- REFACTOR-#4629: Add type annotations to modin/config (#4685)
- REFACTOR-#4717: Improve PartitionMgr.get_indices() usage (#4718)
- REFACTOR-#4730: make Indexer immutable (#4731)
- REFACTOR-#4774: remove
_build_treereduce_func
call from_compute_dtypes
(#4775) - REFACTOR-#4750: Delete BaseDataframeAxisPartition.shuffle (#4751)
- REFACTOR-#4722: Stop suppressing undefined name lint (#4723)
- REFACTOR-#4832: unify
split_result_of_axis_func_pandas
(#4831) - REFACTOR-#4796: Introduce constant for reduced column name (#4799)
- REFACTOR-#4000: Remove code duplication for
PandasOnRayDataframePartitionManager
(#4895) - REFACTOR-#3780: Remove code duplication for
PandasOnDaskDataframe
(#3781) - REFACTOR-#4530: Unify access to physical data for any partition type (#4829)
- REFACTOR-#4978: Align
modin/core/execution/dask/common/__init__.py
withmodin/core/execution/ray/common/__init__.py
(#4979) - REFACTOR-#4949: Remove code duplication in
default2pandas/dataframe.py
anddefault2pandas/any.py
(#4950) - REFACTOR-#4976: Rename
RayTask
toRayWrapper
in accordance with Dask (#4977) - REFACTOR-#4885: De-duplicated take_2d_labels_or_positional methods (#4883)
- REFACTOR-#5005: Use
finalize
method instead of list comprehension +drain_call_queue
(#5006) - REFACTOR-#5001: Remove
jenkins
stuff (#5002) - REFACTOR-#5026: Change exception names to simplify grepping (#5027)
- REFACTOR-#4970: Rewrite base implementations of a partition'
width/length
(#4971) - REFACTOR-#4942: Remove
call
method in favor ofregister
due to duplication (4943) - REFACTOR-#4922: Helpers for take_2d_labels_or_positional (#4865)
- REFACTOR-#5024: Make
_row_lengths
and `_column...
Modin 0.15.3
This release adds support for pandas 1.4.4 and includes a bunch of
bugfixes.
Key Features and Updates
- Stability and Bugfixes
- FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
- FIX-#4604: Fix
groupby
+agg
in case when multicolumn can arise (#4642) - FIX-#4641: Reindex pandas partitions in
df.describe()
(#4651) - FIX-#4634: Check for FrozenList as
by
indf.groupby()
(#4667) - FIX-#2064: Fix
iloc
/loc
assignment when dataframe is empty (#4677) - FIX-#4658: Expand exception handling for
read_*
functions from s3 storages (#4659) - FIX-#4672: Fix incorrect warning when setting
frame.index
orframe.columns
(#4721) - FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
- FIX-#4680: Fix
read_csv
that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681) - FIX-#4808: Set dtypes correctly after column rename (#4809)
- FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
- FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
- FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
- FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
- FIX-#4657: Use
fsspec
for handling s3/http-like paths instead ofs3fs
(#4710) - FIX-#4639: Fix
storage_options
usage forread_csv
andread_csv_glob
(#4644)
- Update testing suite
- Dependencies
Contributors
@helmeleegy
@YarShev
@anmyachev
@pyrito
@prutskov
@jbrockmendel
@mvashishtha
@RehanSD
@vnlitvinov
Modin 0.15.2
This release adds support for pandas 1.4.3, pins protobuf
< 4.0.0 to ensure compatibility with
ray
< 1.13, and includes a bugfix for modifying columns via attribute access.
Key Features and Updates
- Stability and Bugfixes
- Dependencies