From d7002301e27c5fded7b55cc93850ef32e3ef7da0 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Fri, 25 Oct 2024 12:56:50 +0200 Subject: [PATCH] docs --- .../source/package_reference/main_classes.mdx | 1 + docs/source/stream.mdx | 29 +++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx index 402e291be38..185bde10d72 100644 --- a/docs/source/package_reference/main_classes.mdx +++ b/docs/source/package_reference/main_classes.mdx @@ -171,6 +171,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth - batch - skip - take + - shard - load_state_dict - state_dict - info diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx index df5109b25aa..0be393ce4a8 100644 --- a/docs/source/stream.mdx +++ b/docs/source/stream.mdx @@ -136,6 +136,35 @@ You can split your dataset one of two ways: + +### Shard + +🤗 Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the `num_shards` parameter in [`~IterableDataset.shard`] to determine the number of shards to split the dataset into. You'll also need to provide the shard you want to return with the `index` parameter. + +For example, the [amazon_polarity](https://huggingface.co/datasets/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files): + +```py +>>> from datasets import load_dataset +>>> dataset = load_dataset("amazon_polarity", split="train", streaming=True) +>>> print(dataset) +IterableDataset({ + features: ['label', 'title', 'content'], + num_shards: 4 +}) +``` + +After sharding the dataset into two chunks, the first one will only have 2 shards: + +```py +>>> dataset.shard(num_shards=2, index=0) +IterableDataset({ + features: ['label', 'title', 'content'], + num_shards: 2 +}) +``` + +If your dataset has `dataset.num_shards==1`, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead. + ## Interleave [`interleave_datasets`] can combine an [`IterableDataset`] with other datasets. The combined dataset returns alternating examples from each of the original datasets.