Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs on trust_remote_code defaults to False #6981

Merged
merged 5 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/dataset_script.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ as long as your dataset repository has a [required structure](./repository_struc

<Tip warning=true>

In the next major release, the new safety features of 🤗 Datasets will disable running dataset loading scripts by default, and you will have to pass `trust_remote_code=True` to load datasets that require running a dataset script.
For security reasons, 🤗 Datasets have disabled running dataset loading scripts by default, and you have to pass `trust_remote_code=True` to load datasets that require running a dataset script.
albertvillanova marked this conversation as resolved.
Show resolved Hide resolved

</Tip>

Expand Down
4 changes: 2 additions & 2 deletions docs/source/load_hub.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Certain datasets repositories contain a loading script with the Python code used
Those datasets are generally exported to Parquet by Hugging Face, so that 🤗 Datasets can load the dataset fast and without running a loading script.

Even if a Parquet export is not available, you can still use any dataset with Python code in its repository with `load_dataset`.
All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still review the dataset loading scripts and authors to avoid executing malicious code on your machine. You should set `trust_remote_code=True` to use a dataset with a loading script, or you will get a warning:
All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still review the dataset loading scripts and authors to avoid executing malicious code on your machine. You should set `trust_remote_code=True` to use a dataset with a loading script, or you will get an error:

```py
>>> from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset
Expand All @@ -120,6 +120,6 @@ All files and code uploaded to the Hub are scanned for malware (refer to the Hub

<Tip warning=true>

In the next major release, the new safety features of 🤗 Datasets will disable running dataset loading scripts by default, and you will have to pass `trust_remote_code=True` to load datasets that require running a dataset script.
For security reasons, 🤗 Datasets have disabled running dataset loading scripts by default, and you have to pass `trust_remote_code=True` to load datasets that require running a dataset script.
albertvillanova marked this conversation as resolved.
Show resolved Hide resolved

</Tip>
8 changes: 4 additions & 4 deletions src/datasets/hub.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,15 @@ def convert_to_parquet(
`<org>/<dataset_name>`.
revision (`str`, *optional*): Branch of the source Hub dataset repository. Defaults to the `"main"` branch.
token (`bool` or `str`, *optional*): Authentication token for the Hugging Face Hub.
trust_remote_code (`bool`, defaults to `True`): Whether you trust the remote code of the Hub script-based
trust_remote_code (`bool`, defaults to `False`): Whether you trust the remote code of the Hub script-based
dataset to be executed locally on your machine. This option should only be set to `True` for repositories
where you have read the code and which you trust.

<Tip warning={true}>
<Changed version="2.20.0">

`trust_remote_code` will default to False in the next major release.
`trust_remote_code` defaults to `False` if not specified.

</Tip>
</Changed>

Returns:
`huggingface_hub.CommitInfo`
Expand Down
54 changes: 29 additions & 25 deletions src/datasets/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -1749,18 +1749,19 @@ def dataset_module_factory(
Directory to read/write data. Defaults to `"~/.cache/huggingface/datasets"`.

<Added version="2.16.0"/>
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.

<Tip warning={true}>
<Added version="2.16.0"/>

`trust_remote_code` will default to False in the next major release.
<Changed version="2.20.0">

</Tip>
`trust_remote_code` defaults to `False` if not specified.

</Changed>

<Added version="2.16.0"/>
**download_kwargs (additional keyword arguments): optional attributes for DownloadConfig() which will override
the attributes in download_config if supplied.

Expand Down Expand Up @@ -1961,18 +1962,19 @@ def metric_module_factory(
dynamic_modules_path (Optional str, defaults to HF_MODULES_CACHE / "datasets_modules", i.e. ~/.cache/huggingface/modules/datasets_modules):
Optional path to the directory in which the dynamic modules are saved. It must have been initialized with :obj:`init_dynamic_modules`.
By default, the datasets and metrics are stored inside the `datasets_modules` module.
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.

<Tip warning={true}>
<Added version="2.16.0"/>

`trust_remote_code` will default to False in the next major release.
<Changed version="2.20.0">

</Tip>
`trust_remote_code` defaults to `False` if not specified.

</Changed>

<Added version="2.16.0"/>
**download_kwargs (additional keyword arguments): optional attributes for DownloadConfig() which will override
the attributes in download_config if supplied.

Expand Down Expand Up @@ -2078,18 +2080,18 @@ def load_metric(
revision (Optional ``Union[str, datasets.Version]``): if specified, the module will be loaded from the datasets repository
at this version. By default, it is set to the local version of the lib. Specifying a version that is different from
your local version of the lib might cause compatibility issues.
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.

<Tip warning={true}>
<Added version="2.16.0"/>

`trust_remote_code` will default to False in the next major release.
<Changed version="2.20.0">

</Tip>
`trust_remote_code` defaults to `False` if not specified.

<Added version="2.16.0"/>
</Changed>

Returns:
`datasets.Metric`
Expand Down Expand Up @@ -2220,18 +2222,19 @@ def load_dataset_builder(
**Experimental**. Key/value pairs to be passed on to the dataset file-system backend, if any.

<Added version="2.11.0"/>
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.

<Tip warning={true}>
<Added version="2.16.0"/>

<Changed version="2.20.0">

`trust_remote_code` will default to False in the next major release.
`trust_remote_code` defaults to `False` if not specified.

</Tip>
</Changed>

<Added version="2.16.0"/>
**config_kwargs (additional keyword arguments):
Keyword arguments to be passed to the [`BuilderConfig`]
and used in the [`DatasetBuilder`].
Expand Down Expand Up @@ -2481,18 +2484,19 @@ def load_dataset(
**Experimental**. Key/value pairs to be passed on to the dataset file-system backend, if any.

<Added version="2.11.0"/>
trust_remote_code (`bool`, defaults to `True`):
trust_remote_code (`bool`, defaults to `False`):
Whether or not to allow for datasets defined on the Hub using a dataset script. This option
should only be set to `True` for repositories you trust and in which you have read the code, as it will
execute code present on the Hub on your local machine.

<Tip warning={true}>
<Added version="2.16.0"/>

`trust_remote_code` will default to False in the next major release.
<Changed version="2.20.0">

</Tip>
`trust_remote_code` defaults to `False` if not specified.

</Changed>

<Added version="2.16.0"/>
**config_kwargs (additional keyword arguments):
Keyword arguments to be passed to the `BuilderConfig`
and used in the [`DatasetBuilder`].
Expand Down
Loading