[RELEASE] cudf v24.10 #16943

raydouglass · 2024-09-27T14:36:03Z

❄️ Code freeze for `branch-24.10` and v24.10 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.10 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-24.10 into main for the release

…6454) `cudf.Series` is a public constructor that happens to accept a private `ColumnBase` object. Many ops return Columns and is natural to want to reconstruct a `Series`. This PR adds a `SingleColumnFrame._from_column` classmethod for instances where we need to wrap a new column in an `Index` or `Series`. This constructor also passes some unneeded validation in `ColumnAccessor` and `Series` Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16454

Forward-merge branch-24.08 into branch-24.10

Add `stream` param to a bunch of stream compaction APIs. Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - Mark Harris (https://github.com/harrism) - Karthikeyan (https://github.com/karthikeyann) - Mike Wilson (https://github.com/hyperbolic2346) URL: #16295

…rsion (#16503) Contributes to rapidsai/build-planning#58. `scikit-build-core==0.10.0` was released today (https://github.com/scikit-build/scikit-build-core/releases/tag/v0.10.0), and wheel-building configurations across RAPIDS are incompatible with it. This proposes upgrading to that version and fixing configuration here in a way that: * is compatible with that new `scikit-build-core` version * takes advantage of the forward-compatibility mechanism (`minimum-version`) that `scikit-build-core` provides, to reduce the risk of needing to do this again in the future Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham URL: #16503

Exposes the `stream` param in transform APIs Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #16452

…16498) Demonstrates the conversion from an `arrow:StringViewArray` to a `cudf::column` Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: #16498

Changes the integer type for `cudf::strings::ipv4_to_integers` and `cudf::strings::integers_to_ipv4` to use UINT32 types instead of INT64. The INT64 type was originally chosen because libcudf did not support unsigned types at the time. This is a breaking change since the basic input/output type is changed. Closes #16324 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - https://github.com/brandon-b-miller - Karthikeyan (https://github.com/karthikeyann) URL: #16489

A few small tweaks to `update-version.sh` for alignment across RAPIDS. The `UCX_PY` curl call is unused. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16506

This PR updates pre-commit hooks to the latest versions that are supported without causing style check errors. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16510

@srinivasyadav18

This PR adopts some work from @srinivasyadav18 with additional modifications. This is meant to complement #16484. Authors: - Bradley Dice (https://github.com/bdice) - Srinivas Yadav (https://github.com/srinivasyadav18) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Srinivas Yadav (https://github.com/srinivasyadav18) URL: #16497

closes #15278 This PR allows list type also forced as string when mixed type as string is enabled and a user given schema specifies a column as string, in JSON reader. Authors: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Shruti Shivakumar (https://github.com/shrshi) URL: #16472

Removes overloaded `cudf::io::text::multibyte_split` API deprecated in 24.08 and is no longer needed. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #16501

Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Karthikeyan (https://github.com/karthikeyann) URL: #16423

This change updates json normalization calls (quote and whitespace normalization) to take owning buffer of device_buffer as input rather than device_uvector. It makes it easy to hand over a string_column's char buffer to normalization calls. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - David Wendt (https://github.com/davidwendt) - Shruti Shivakumar (https://github.com/shrshi) URL: #16520

closes #14794 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16519

#16516) xref #16507 `date_range` generates its dates via `range`, and the end of this range was calculated via `math.ceil((end - start) / freq)`. If `(end - start) / freq` did not produce a remainder, `math.ceil` would not correctly increment this value by `1` to capture the last date. Instead, this PR uses `math.floor((end - start) / freq) + 1` to always ensure the last date is captured Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Bradley Dice (https://github.com/bdice) URL: #16516

xref #16507 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16515

xref #16507 I would say this was a bug before because we would silently return a new DataFrame with just `len(set(column_labels))` when selecting by column. Now this operation raises since duplicate column labels are generally not supported. Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - https://github.com/brandon-b-miller URL: #16514

Removing some more deprecated public libcudf APIs. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #16524

The JSON reader set the batch size to `INT_MAX` bytes since the motivation for implementing a batched JSON reader was to parse source files whose total size is larger than `INT_MAX` (#16138, #16162). However, we can use a much smaller batch size to evaluate the correctness of the reader and speed up tests significantly. This PR focuses on reducing runtime of the batched reader test by setting the batch size to be used by the reader as an environment variable. The runtime of `JsonLargeReaderTest.MultiBatch` in `LARGE_STRINGS_TEST` gtest drops from ~52s to ~3s. Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #16502

…rings (#16536) Recently some JSON parsing was updated so lists could be returned as strings. This updates the java code so that when cleaning up the results to match the desired schema that it can handle corner cases associated with lists and structs properly. Tests are covered in the Spark plugin, but I am happy to add some here if we really want to validate that part of this. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Nghia Truong (https://github.com/ttnghia) URL: #16536

Adds `const` declarations to appropriate member functions in class `cudf::io::text::byte_range_info` and moves the ctor implementation to .cpp file. This helps with using the `byte_range_info` objects in `const` variables and inside of `const` functions. Found while working on #15983 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Bradley Dice (https://github.com/bdice) URL: #16518

Fixes specialized behavior for all empty input column on the strings split APIs. Verifying behavior with Pandas `str.split( pat, expand, regex )` `pat=None -- whitespace` `expand=False -- record APIs` `regex=True -- re APIs` - [x] `split` - [x] `split` - whitespace - [x] `rsplit` - [x] `rsplit` - whitespace - [x] `split_record` - [x] `split_record` - whitespace - [x] `rsplit_record` - [x] `rsplit_record` - whitespace - [x] `split_re` - [x] `rsplit_re` - [x] `split_record_re` - [x] `rsplit_record_re` Closes #16453 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #16466

Removes the pair-iterator benchmark logic. The remaining benchmarks use the null-replacement-iterator which uses the libcudf pair-iterator internally. There is no need for benchmarking this unique iterator pattern that is not used by libcudf. The `cpp/benchmarks/iterator/iterator.cu` failed to compile with gcc 12 because the sum-reduce function cannot resolve adding `thrust::pair` objects together likely due to some recent changes in CCCL. Regardless, adding `thrust::pair` objects is not something we need to benchmark. The existing benchmark benchmarks libcudf's usage of the internal pair-iterator correctly. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #16511

This PR removes hardcoded Python versions from CI workflows. It is a prerequisite for dropping Python 3.9. See rapidsai/build-planning#88. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16540

After dask/dask-expr#1114, Dask cuDF must register specific `read_parquet` and `read_csv` functions to be used when query-planning is enabled (the default). **This PR is required for CI to pass with dask>2024.8.0** **NOTE**: It probably doesn't make sense to add specific tests for this change. Once the 2014.7.1 dask pin is removed, all `dask_cudf` tests using `read_parquet` and `read_csv` will fail without this change... Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Benjamin Zaitlen (https://github.com/quasiben) URL: #16535

) When Python integers are compared to a series of integers, the result can always be correctly defined no matter the values of the Python integer. This was always a very mild issue. But with NumPy 2 behavior not upcasting the computation result type based on the value anymore, even things like: ``` cudf.Series([1, 2, 3], dtype="int8") < 1000 ``` would fail. (Similar paths could be taken for other integer scalars, but there would be mostly nice for performance.) N.B. NumPy/pandas also support exact comparisons when mixing e.g. uint64 and int64. This is another rare exception that cudf currently does not support. Closes gh-16282 Authors: - Sebastian Berg (https://github.com/seberg) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16532

…mns (#16529) Fixes `cudf::empty_like` to only create empty child columns for nested types. The empty child columns are needed to store the types for consistency with `cudf::make_empty_column`. Closes #16490 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Mark Harris (https://github.com/harrism) URL: #16529

…lity (#16531) Removes `output_size` parameter from `cudf::strings::detail::count_matches` utility since the output size should equal the input size from the first parameter. This also removes an unnecessary `assert()` call. The parameter became unnecessary as part of the large strings work. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Shruti Shivakumar (https://github.com/shrshi) URL: #16531

…16559) python 3.9 support was recently dropped in rapids, hence changing the python version to 3.10 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #16559

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16771

Contributes to #15162 Authors: - Matthew Roeschke (https://github.com/mroeschke) - Matthew Murray (https://github.com/Matt711) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Matthew Murray (https://github.com/Matt711) - Vyas Ramasubramani (https://github.com/vyasr) URL: #16781

More follow-up fixes to the recent Dask-cuDF documentation additions. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #16929

copy-pr-bot · 2024-09-27T14:36:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

review-notebook-app · 2024-09-27T14:36:13Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…ith non-newline delimiter (#16950) Backporting PR #16923: : Parse newline as whitespace character while tokenizing JSONL inputs Addresses #16915

mythrocks

👍

Add the license file symlink to the `pylibcudf` wheels

mroeschke and others added 30 commits August 7, 2024 00:48

Merge pull request #16505 from rapidsai/branch-24.08

d11d2cf

Forward-merge branch-24.08 into branch-24.10

Expose stream param in transform APIs (#16452)

c146eed

Exposes the `stream` param in transform APIs Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #16452

Improve update-version.sh (#16506)

da51cad

A few small tweaks to `update-version.sh` for alignment across RAPIDS. The `UCX_PY` curl call is unused. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16506

Update pre-commit hooks (#16510)

792dd06

This PR updates pre-commit hooks to the latest versions that are supported without causing style check errors. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - James Lamb (https://github.com/jameslamb) URL: #16510

Update docs of the TPC-H derived examples (#16423)

8009dc8

Authors: - Jayjeet Chakraborty (https://github.com/JayjeetAtGithub) Approvers: - Karthikeyan (https://github.com/karthikeyann) URL: #16423

Allow DataFrame.sort_values(by=) to select an index level (#16519)

16aa0ea

closes #14794 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16519

Preserve array name in MultiIndex.from_arrays (#16515)

45b20d1

xref #16507 Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #16515

Remove deprecated public APIs from libcudf (#16524)

091cb72

Removing some more deprecated public libcudf APIs. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #16524

mroeschke and others added 3 commits September 26, 2024 02:48

raydouglass requested review from a team as code owners September 27, 2024 14:36

raydouglass requested review from KyleFromNVIDIA, wence-, Matt711 and mythrocks September 27, 2024 14:36

raydouglass requested a review from davidwendt September 27, 2024 14:36

ttnghia approved these changes Sep 27, 2024

View reviewed changes

Parse newline as whitespace character while tokenizing JSONL inputs w…

f20491d

…ith non-newline delimiter (#16950) Backporting PR #16923: : Parse newline as whitespace character while tokenizing JSONL inputs Addresses #16915

mythrocks approved these changes Oct 1, 2024

View reviewed changes

raydouglass and others added 2 commits October 2, 2024 14:59

Add license to the pylibcudf wheel (#16976)

8a9df04

Add the license file symlink to the `pylibcudf` wheels

Update Changelog [skip ci]

319a533

raydouglass merged commit 39a5beb into main Oct 9, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] cudf v24.10 #16943

[RELEASE] cudf v24.10 #16943

raydouglass commented Sep 27, 2024

copy-pr-bot bot commented Sep 27, 2024

review-notebook-app bot commented Sep 27, 2024

mythrocks left a comment

[RELEASE] cudf v24.10 #16943

[RELEASE] cudf v24.10 #16943

Conversation

raydouglass commented Sep 27, 2024

❄️ Code freeze for branch-24.10 and v24.10 release

What does this mean?

What is the purpose of this PR?

copy-pr-bot bot commented Sep 27, 2024

review-notebook-app bot commented Sep 27, 2024

mythrocks left a comment

Choose a reason for hiding this comment

❄️ Code freeze for `branch-24.10` and v24.10 release