diff --git a/docs/source/about_mapstyle_vs_iterable.mdx b/docs/source/about_mapstyle_vs_iterable.mdx index 84bdb108ed4..08fbf80fa6b 100644 --- a/docs/source/about_mapstyle_vs_iterable.mdx +++ b/docs/source/about_mapstyle_vs_iterable.mdx @@ -166,7 +166,7 @@ It provides even faster data loading when iterating using a `for` loop by iterat However as soon as your [`Dataset`] has an indices mapping (via [`Dataset.shuffle`] for example), the speed can become 10x slower. This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren't reading contiguous chunks of data anymore. To restore the speed, you'd need to rewrite the entire dataset on your disk again using [`Dataset.flatten_indices`], which removes the indices mapping. -This may take a lot of time depending of the size of your dataset though: +This may take a lot of time depending on the size of your dataset though: ```python my_dataset[0] # fast @@ -215,9 +215,9 @@ To restart the iteration of a map-style dataset, you can simply skip the first e my_dataset = my_dataset.select(range(start_index, len(dataset))) ``` -But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have write a custom sampler that allows resuming). +But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have written a custom sampler that allows resuming). -On the other hand, iterable datasets don't provide random access to a specific example inde to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers: +On the other hand, iterable datasets don't provide random access to a specific example index to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers: ```python >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3) diff --git a/docs/source/dataset_card.mdx b/docs/source/dataset_card.mdx index 2925ddc97b5..629050cf054 100644 --- a/docs/source/dataset_card.mdx +++ b/docs/source/dataset_card.mdx @@ -4,7 +4,7 @@ Each dataset should have a dataset card to promote responsible usage and inform This idea was inspired by the Model Cards proposed by [Mitchell, 2018](https://arxiv.org/abs/1810.03993). Dataset cards help users understand a dataset's contents, the context for using the dataset, how it was created, and any other considerations a user should be aware of. -Creating a dataset card is easy and can be done in a just a few steps: +Creating a dataset card is easy and can be done in just a few steps: 1. Go to your dataset repository on the [Hub](https://hf.co/new-dataset) and click on **Create Dataset Card** to create a new `README.md` file in your repository. diff --git a/docs/source/faiss_es.mdx b/docs/source/faiss_es.mdx index 120f146b458..ddd274759ea 100644 --- a/docs/source/faiss_es.mdx +++ b/docs/source/faiss_es.mdx @@ -1,6 +1,6 @@ # Search index -[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on a Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question. +[FAISS](https://github.com/facebookresearch/faiss) and [Elasticsearch](https://www.elastic.co/elasticsearch/) enables searching for examples in a dataset. This can be useful when you want to retrieve specific examples from a dataset that are relevant to your NLP task. For example, if you are working on an Open Domain Question Answering task, you may want to only return examples that are relevant to answering your question. This guide will show you how to build an index for your dataset that will allow you to search it.