From d7002301e27c5fded7b55cc93850ef32e3ef7da0 Mon Sep 17 00:00:00 2001
From: Quentin Lhoest <lhoest.q@gmail.com>
Date: Fri, 25 Oct 2024 12:56:50 +0200
Subject: [PATCH] docs

---
 .../source/package_reference/main_classes.mdx |  1 +
 docs/source/stream.mdx                        | 29 +++++++++++++++++++
 2 files changed, 30 insertions(+)
diff --git a/docs/source/package_reference/main_classes.mdx b/docs/source/package_reference/main_classes.mdx
index 402e291be38..185bde10d72 100644
--- a/docs/source/package_reference/main_classes.mdx
+++ b/docs/source/package_reference/main_classes.mdx
@@ -171,6 +171,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - batch
     - skip
     - take
+    - shard
     - load_state_dict
     - state_dict
     - info
diff --git a/docs/source/stream.mdx b/docs/source/stream.mdx
index df5109b25aa..0be393ce4a8 100644
--- a/docs/source/stream.mdx
+++ b/docs/source/stream.mdx
@@ -136,6 +136,35 @@ You can split your dataset one of two ways:
 
 <a id='interleave_datasets'></a>
 
+
+### Shard
+
+🤗 Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the `num_shards` parameter in [`~IterableDataset.shard`] to determine the number of shards to split the dataset into. You'll also need to provide the shard you want to return with the `index` parameter.
+
+For example, the [amazon_polarity](https://huggingface.co/datasets/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files):
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("amazon_polarity", split="train", streaming=True)
+>>> print(dataset)
+IterableDataset({
+    features: ['label', 'title', 'content'],
+    num_shards: 4
+})
+```
+
+After sharding the dataset into two chunks, the first one will only have 2 shards:
+
+```py
+>>> dataset.shard(num_shards=2, index=0)
+IterableDataset({
+    features: ['label', 'title', 'content'],
+    num_shards: 2
+})
+```
+
+If your dataset has `dataset.num_shards==1`, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.
+
 ## Interleave
 
 [`interleave_datasets`] can combine an [`IterableDataset`] with other datasets. The combined dataset returns alternating examples from each of the original datasets.