Skip to content

Commit

Permalink
docs
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 25, 2024
1 parent b4a98f4 commit d700230
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/source/package_reference/main_classes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
- batch
- skip
- take
- shard
- load_state_dict
- state_dict
- info
Expand Down
29 changes: 29 additions & 0 deletions docs/source/stream.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,35 @@ You can split your dataset one of two ways:

<a id='interleave_datasets'></a>


### Shard

🤗 Datasets supports sharding to divide a very large dataset into a predefined number of chunks. Specify the `num_shards` parameter in [`~IterableDataset.shard`] to determine the number of shards to split the dataset into. You'll also need to provide the shard you want to return with the `index` parameter.

For example, the [amazon_polarity](https://huggingface.co/datasets/amazon_polarity) dataset has 4 shards (in this case they are 4 Parquet files):

```py
>>> from datasets import load_dataset
>>> dataset = load_dataset("amazon_polarity", split="train", streaming=True)
>>> print(dataset)
IterableDataset({
features: ['label', 'title', 'content'],
num_shards: 4
})
```

After sharding the dataset into two chunks, the first one will only have 2 shards:

```py
>>> dataset.shard(num_shards=2, index=0)
IterableDataset({
features: ['label', 'title', 'content'],
num_shards: 2
})
```

If your dataset has `dataset.num_shards==1`, you should chunk it using [`IterableDataset.skip`] and [`IterableDataset.take`] instead.

## Interleave

[`interleave_datasets`] can combine an [`IterableDataset`] with other datasets. The combined dataset returns alternating examples from each of the original datasets.
Expand Down

0 comments on commit d700230

Please sign in to comment.