Specify `columns` when reading files with `DocumentDataset` #311

sarahyurick · 2024-10-18T19:44:30Z

Closes #180.

All of these will now work:

(1) Pandas and cuDF read_json do not support a columns parameter, so we read in the entire DataFrame and then remove unwanted columns behind the scenes.

dataset = DocumentDataset.read_json(dataset_path, columns=["col1", "col2"])

(2) Pandas and cuDF read_parquet both support a columns parameter, so we are able to take advantage of this functionality.

dataset = DocumentDataset.read_parquet(dataset_path, columns=["col1", "col2"])

(3) Pandas read_pickle (there is no cuDF read_pickle) does not support a columns parameter, so we read in the entire DataFrame and then remove unwanted columns behind the scenes.

dataset = DocumentDataset.read_pickle(dataset_path, columns=["col1", "col2"])

(4) Following cudf.read_json, you can specify dtype and prune_columns=True to only return the columns mentioned in the dtype argument. Note that Pandas does not support prune_columns.

dataset = DocumentDataset.read_json(
    dataset_path,
    dtype={"col1": str, "col2": str},
    prune_columns=True,
    backend="cudf",
)

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

nemo_curator/datasets/doc_dataset.py

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

ryantwolf

One minor thing, then we should be good.

nemo_curator/utils/distributed_utils.py

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

ryantwolf

Good with me, thanks!

sarahyurick added 2 commits October 18, 2024 12:30

add column param

7036095

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

read_pickle and black

8b9fbc3

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick requested a review from ryantwolf October 18, 2024 20:05

praateekmahajan reviewed Oct 18, 2024

View reviewed changes

nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved

optional param

f1ee675

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick requested a review from praateekmahajan October 18, 2024 21:16

sarahyurick mentioned this pull request Oct 21, 2024

[DRAFT] Passing meta to map_partitions for read_data #291

Draft

3 tasks

ryantwolf reviewed Oct 21, 2024

View reviewed changes

nemo_curator/utils/distributed_utils.py Show resolved Hide resolved

sarahyurick mentioned this pull request Oct 21, 2024

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Open

add parquet comment

71aedc0

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick requested a review from ryantwolf October 21, 2024 21:54

ryantwolf approved these changes Oct 22, 2024

View reviewed changes

sarahyurick merged commit dc87963 into NVIDIA:main Oct 22, 2024
3 checks passed

sarahyurick deleted the read_columns branch October 25, 2024 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify `columns` when reading files with `DocumentDataset` #311

Specify `columns` when reading files with `DocumentDataset` #311

sarahyurick commented Oct 18, 2024

ryantwolf left a comment

ryantwolf left a comment

Specify columns when reading files with DocumentDataset #311

Specify columns when reading files with DocumentDataset #311

Conversation

sarahyurick commented Oct 18, 2024

ryantwolf left a comment

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

Specify `columns` when reading files with `DocumentDataset` #311

Specify `columns` when reading files with `DocumentDataset` #311