Skip to content

Commit

Permalink
Fix Overview.ipynb & detach Jupyter Notebooks from datasets repos…
Browse files Browse the repository at this point in the history
…itory (#5902)

* Fix and re-run `Overview.ipynb`

* Update `quickstart.mdx`

* Re-ordered subsections so that `Text` goes first
* Add machine learning frameworks missing install instructions
* Add [[open-in-colab]] button
* Add missing license

* Fix references to new sub-sections

* Remove not required exclamation marks

My guess was that the exclamation mark was used for highlighting but it's not, so reverted: 🤗 Datasets! -> 🤗 Datasets

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Add `datasets_doc` to host notebooks in `hugginface/notebooks`

* Add `notebooks/README.md`

As of this commit, the URLs throw a 404 as those are pointing to unpushed notebooks, to be pushed as part of `build_documentation`

* Apply suggestions from code review

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

* Apply suggestions from code review

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

* Revert `Image` and `Text` renames

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

* Remove reference to `to_tf_dataset`

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Add deprecation message in `Overview.ipynb`

In favor of https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

* Add `transformers`, `torch`, and `tensorflow` in `docs` extra

For the `TFPreTrainedModel.prepare_tf_dataset` and `DataLoader` to be built properly

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>

* Add `albumentations` to extend data preparation

Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Minor improvements

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
  • Loading branch information
3 people authored Jul 25, 2023
1 parent f3da7a5 commit 971e33e
Show file tree
Hide file tree
Showing 7 changed files with 1,155 additions and 2,435 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@ on:
- main
- doc-builder*
- v*-release
- v*-patch

jobs:
build:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: datasets
notebook_folder: datasets_doc
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,9 +134,6 @@ For more details on using the library, check the quick start page in the documen
- Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script
- etc.

Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)

# Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).
Expand Down
9 changes: 9 additions & 0 deletions docs/source/_config.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
# docstyle-ignore
INSTALL_CONTENT = """
# Datasets installation
! pip install datasets transformers
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/datasets.git
"""

notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
default_branch_name = "main"
version_prefix = ""
112 changes: 91 additions & 21 deletions docs/source/quickstart.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Quickstart

[[open-in-colab]]

This quickstart is intended for developers who are ready to dive into the code and see an example of how to integrate 🤗 Datasets into their model training workflow. If you're a beginner, we recommend starting with our [tutorials](./tutorial), where you'll get a more thorough introduction.

Each dataset is unique, and depending on the task, some datasets may require additional steps to prepare it for training. But you can always use 🤗 Datasets tools to load and process a dataset. The fastest and easiest way to get started is by loading an existing dataset from the [Hugging Face Hub](https://huggingface.co/datasets). There are thousands of datasets to choose from, spanning many tasks. Choose the type of dataset you want to work with, and let's get started!
Expand Down Expand Up @@ -33,17 +47,34 @@ Start by installing 🤗 Datasets:
pip install datasets
```

To work with audio datasets, install the [`Audio`] feature:
🤗 Datasets also support audio and image data formats:

```bash
pip install datasets[audio]
```
* To work with audio datasets, install the [`Audio`] feature:

```bash
pip install datasets[audio]
```

* To work with image datasets, install the [`Image`] feature:

To work with image datasets, install the [`Image`] feature:
```bash
pip install datasets[vision]
```

Besides 🤗 Datasets, make sure your preferred machine learning framework is installed:

<frameworkcontent>
<pt>
```bash
pip install torch
```
</pt>
<tf>
```bash
pip install datasets[vision]
pip install tensorflow
```
</tf>
</frameworkcontent>

## Audio

Expand Down Expand Up @@ -116,16 +147,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
```
</pt>
<tf>
Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
```py
>>> import tensorflow as tf

>>> tf_dataset = dataset.to_tf_dataset(
... columns=["input_values"],
... label_cols=["labels"],
>>> tf_dataset = model.prepare_tf_dataset(
... dataset,
... batch_size=4,
... shuffle=True)
... shuffle=True,
... )
```
</tf>
</frameworkcontent>
Expand Down Expand Up @@ -190,6 +224,42 @@ Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc
>>> dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)
```
</pt>
<tf>

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:

```bash
pip install -U albumentations opencv-python
```

```py
>>> import albumentations
>>> import numpy as np

>>> transform = albumentations.Compose([
... albumentations.RandomCrop(width=256, height=256),
... albumentations.HorizontalFlip(p=0.5),
... albumentations.RandomBrightnessContrast(p=0.2),
... ])

>>> def transforms(examples):
... examples["pixel_values"] = [
... transform(image=np.array(image))["image"] for image in examples["image"]
... ]
... return examples

>>> dataset.set_transform(transforms)
>>> tf_dataset = model.prepare_tf_dataset(
... dataset,
... batch_size=4,
... shuffle=True,
... )
```
</tf>
</frameworkcontent>

**6**. Start training with your machine learning framework! Check out the 🤗 Transformers [image classification guide](https://huggingface.co/docs/transformers/tasks/image_classification) for an end-to-end example of how to train a model on an image dataset.
Expand Down Expand Up @@ -259,19 +329,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
```
</pt>
<tf>
Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:

Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.

```py
>>> import tensorflow as tf
>>> from transformers import DataCollatorWithPadding

>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
>>> tf_dataset = dataset.to_tf_dataset(
... columns=["input_ids", "token_type_ids", "attention_mask"],
... label_cols=["labels"],
... batch_size=2,
... collate_fn=data_collator,
... shuffle=True)

>>> tf_dataset = model.prepare_tf_dataset(
... dataset,
... batch_size=4,
... shuffle=True,
... )
```
</tf>
</frameworkcontent>
Expand Down
Loading

0 comments on commit 971e33e

Please sign in to comment.