Fix Overview.ipynb & detach Jupyter Notebooks from datasets repos…

…itory (#5902) * Fix and re-run `Overview.ipynb` * Update `quickstart.mdx` * Re-ordered subsections so that `Text` goes first * Add machine learning frameworks missing install instructions * Add [[open-in-colab]] button * Add missing license * Fix references to new sub-sections * Remove not required exclamation marks My guess was that the exclamation mark was used for highlighting but it's not, so reverted: 🤗 Datasets! -> 🤗 Datasets * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Add `datasets_doc` to host notebooks in `hugginface/notebooks` * Add `notebooks/README.md` As of this commit, the URLs throw a 404 as those are pointing to unpushed notebooks, to be pushed as part of `build_documentation` * Apply suggestions from code review Co-authored-by: Mario Šaško <mariosasko777@gmail.com> * Apply suggestions from code review Co-authored-by: Mario Šaško <mariosasko777@gmail.com> * Revert `Image` and `Text` renames Co-authored-by: Mario Šaško <mariosasko777@gmail.com> * Remove reference to `to_tf_dataset` Co-authored-by: Mario Šaško <mariosasko777@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Add deprecation message in `Overview.ipynb` In favor of https://github.com/huggingface/notebooks/blob/main/datasets_doc/quickstart.ipynb Co-authored-by: Mario Šaško <mariosasko777@gmail.com> * Add `transformers`, `torch`, and `tensorflow` in `docs` extra For the `TFPreTrainedModel.prepare_tf_dataset` and `DataLoader` to be built properly Co-authored-by: Mario Šaško <mariosasko777@gmail.com> * Add `albumentations` to extend data preparation Co-authored-by: Mario Šaško <mariosasko777@gmail.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Minor improvements --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Mario Šaško <mariosasko777@gmail.com>
huggingface · Jul 25, 2023 · 971e33e · 971e33e
1 parent f3da7a5
commit 971e33e
Show file tree

Hide file tree

Showing 7 changed files with 1,155 additions and 2,435 deletions.
diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml
@@ -6,13 +6,15 @@ on:
       - main
       - doc-builder*
       - v*-release
+      - v*-patch
 
 jobs:
-   build:
+  build:
     uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
     with:
       commit_sha: ${{ github.sha }}
       package: datasets
+      notebook_folder: datasets_doc
     secrets:
       token: ${{ secrets.HUGGINGFACE_PUSH }}
       hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
diff --git a/README.md b/README.md
@@ -134,9 +134,6 @@ For more details on using the library, check the quick start page in the documen
 - Writing your own dataset loading script: https://huggingface.co/docs/datasets/dataset_script
 - etc.
 
-Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)
-
 # Add a new dataset to the Hub
 
 We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).

diff --git a/docs/source/_config.py b/docs/source/_config.py
@@ -1,2 +1,11 @@
+# docstyle-ignore
+INSTALL_CONTENT = """
+# Datasets installation
+! pip install datasets transformers
+# To install from source instead of the last release, comment the command above and uncomment the following one.
+# ! pip install git+https://github.com/huggingface/datasets.git
+"""
+
+notebook_first_cells = [{"type": "code", "content": INSTALL_CONTENT}]
 default_branch_name = "main"
 version_prefix = ""
diff --git a/docs/source/quickstart.mdx b/docs/source/quickstart.mdx
@@ -1,5 +1,19 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
 # Quickstart
 
+[[open-in-colab]]
+
 This quickstart is intended for developers who are ready to dive into the code and see an example of how to integrate 🤗 Datasets into their model training workflow. If you're a beginner, we recommend starting with our [tutorials](./tutorial), where you'll get a more thorough introduction.
 
 Each dataset is unique, and depending on the task, some datasets may require additional steps to prepare it for training. But you can always use 🤗 Datasets tools to load and process a dataset. The fastest and easiest way to get started is by loading an existing dataset from the [Hugging Face Hub](https://huggingface.co/datasets). There are thousands of datasets to choose from, spanning many tasks. Choose the type of dataset you want to work with, and let's get started!
@@ -33,17 +47,34 @@ Start by installing 🤗 Datasets:
 pip install datasets
 ```
 
-To work with audio datasets, install the [`Audio`] feature:
+🤗 Datasets also support audio and image data formats:
 
-```bash
-pip install datasets[audio]
-```
+* To work with audio datasets, install the [`Audio`] feature:
+
+   ```bash
+   pip install datasets[audio]
+   ```
+
+* To work with image datasets, install the [`Image`] feature:
 
-To work with image datasets, install the [`Image`] feature:
+   ```bash
+   pip install datasets[vision]
+   ```
 
+Besides 🤗 Datasets, make sure your preferred machine learning framework is installed:
+
+<frameworkcontent>
+<pt>
+```bash
+pip install torch
+```
+</pt>
+<tf>
 ```bash
-pip install datasets[vision]
+pip install tensorflow
 ```
+</tf>
+</frameworkcontent>
 
 ## Audio
 
@@ -116,16 +147,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
 ```
 </pt>
 <tf>
-Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:
+
+Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
+TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
+with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification. 
 
 ```py
 >>> import tensorflow as tf
 
->>> tf_dataset = dataset.to_tf_dataset(
-...     columns=["input_values"],
-...     label_cols=["labels"],
+>>> tf_dataset = model.prepare_tf_dataset(
+...     dataset,
 ...     batch_size=4,
-...     shuffle=True)
+...     shuffle=True,
+... )
 ```
 </tf>
 </frameworkcontent>
@@ -190,6 +224,42 @@ Wrap the dataset in [`torch.utils.data.DataLoader`](https://alband.github.io/doc
 >>> dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)
 ```
 </pt>
+<tf>
+
+Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
+TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
+with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification.
+
+Before you start, make sure you have up-to-date versions of `albumentations` and `cv2` installed:
+
+```bash
+pip install -U albumentations opencv-python
+```
+
+```py
+>>> import albumentations
+>>> import numpy as np
+
+>>> transform = albumentations.Compose([
+...     albumentations.RandomCrop(width=256, height=256),
+...     albumentations.HorizontalFlip(p=0.5),
+...     albumentations.RandomBrightnessContrast(p=0.2),
+... ])
+
+>>> def transforms(examples):
+...     examples["pixel_values"] = [
+...         transform(image=np.array(image))["image"] for image in examples["image"]
+...     ]
+...     return examples
+
+>>> dataset.set_transform(transforms)
+>>> tf_dataset = model.prepare_tf_dataset(
+...     dataset,
+...     batch_size=4,
+...     shuffle=True,
+... )
+```
+</tf>
 </frameworkcontent>
 
 **6**. Start training with your machine learning framework! Check out the 🤗 Transformers [image classification guide](https://huggingface.co/docs/transformers/tasks/image_classification) for an end-to-end example of how to train a model on an image dataset.
@@ -259,19 +329,19 @@ Use the [`~Dataset.set_format`] function to set the dataset format to `torch` an
 ```
 </pt>
 <tf>
-Use the [`~Dataset.to_tf_dataset`] function to set the dataset format to be compatible with TensorFlow. You'll also need to import a [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding) from 🤗 Transformers to combine the varying sequence lengths into a single batch of equal lengths:
+
+Use the [`~transformers.TFPreTrainedModel.prepare_tf_dataset`] method from 🤗 Transformers to prepare the dataset to be compatible with
+TensorFlow, and ready to train/fine-tune a model, as it wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset`
+with collation and batching, so one can pass it directly to Keras methods like `fit()` without further modification. 
 
 ```py
 >>> import tensorflow as tf
->>> from transformers import DataCollatorWithPadding
-
->>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
->>> tf_dataset = dataset.to_tf_dataset(
-...     columns=["input_ids", "token_type_ids", "attention_mask"],
-...     label_cols=["labels"],
-...     batch_size=2,
-...     collate_fn=data_collator,
-...     shuffle=True)
+
+>>> tf_dataset = model.prepare_tf_dataset(
+...     dataset,
+...     batch_size=4,
+...     shuffle=True,
+... )
 ```
 </tf>
 </frameworkcontent>