docs

huggingface · Oct 25, 2024 · d700230 · d700230
1 parent b4a98f4
commit d700230
Show file tree

Hide file tree

Showing 2 changed files with 30 additions and 0 deletions.
diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx
@@ -171,6 +171,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - batch
     - skip
     - take
+    - shard
     - load_state_dict
     - state_dict
     - info

diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
@@ -136,6 +136,35 @@ You can split your dataset one of two ways:
 
 <a id='interleave_datasets'></a>
 
+
+### Shard
+
+🤗 Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the `num_shards` parameter in [`~IterableDataset.shard`] to determine the number of shards to split the dataset into. You'll also need to provide the shard you want to return with the `index` parameter.
+
+For example, the [amazon_polarity](https://huggingface.co/datasets/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files):
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("amazon_polarity", split="train", streaming=True)
+>>> print(dataset)
+IterableDataset({
+    features: ['label', 'title', 'content'],
+    num_shards: 4
+})
+```
+
+After sharding the dataset into two chunks, the first one will only have 2 shards:
+
+```py
+>>> dataset.shard(num_shards=2, index=0)
+IterableDataset({
+    features: ['label', 'title', 'content'],
+    num_shards: 2
+})
+```
+
+If your dataset has `dataset.num_shards==1`, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
+
 ## Interleave
 
 [`interleave_datasets`] can combine an [`IterableDataset`] with other datasets. The combined dataset returns alternating examples from each of the original datasets.