Use LibYAML with PyYAML if available #6266

bryant1410 · 2023-09-27T21:13:36Z

PyYAML, the YAML framework used in this library, allows the use of LibYAML to accelerate the methods load and dump. To use it, a user would need to first install a PyYAML version that uses LibYAML (not available in PyPI; needs to be manually installed). Then, to actually use them, PyYAML suggests importing the LibYAML version of the Loader and Dumper and falling back to the default ones. This PR implements this change. See PyYAML docs for more info.

This change was motivated after trying to use any of the SugarCREPE datasets in the Hub provided by the org HuggingFaceM4. Such datasets save a lot of information (~1MB) in the YAML metadata from the README.md file and I noticed this slowed down the data loading process. BTW, I also noticed cache files for it is also slow because it tries to hash an instance of DatasetInfo, which in turn has all this metadata.

Also, I changed two list comprehensions into generator expressions to avoid allocating extra memory unnecessarily.

And BTW, there's an issue in PyYAML suggesting to make this automatic.

HuggingFaceDocBuilderDev · 2023-09-27T21:21:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

bryant1410 · 2023-09-27T21:56:29Z

On Ubuntu, if libyaml-dev is installed, you can install PyYAML 6.0.1 with LibYAML with the following command (as it's automatically detected):

pip install git+https://github.com/yaml/pyyaml.git@6.0.1

bryant1410 · 2023-09-28T00:33:11Z

Are the failing tests flaky?

mariosasko · 2023-09-28T13:48:55Z

We use huggingface_hub's RepoCard API instead of these modules to parse the YAML block (notice the deprecations), so the huggingface_hub repo is the right place to suggest these changes.

Personally, I'm not a fan of these changes, as a single non-standard usage of the ClassLabel type is not a sufficient reason to merge them. Also, the dataset in question stores data in a single Parquet file, with the features info embedded in its (schema) metadata, which means the YAML parsing can be skipped while preserving the features by directly loading the Parquet file:

from datasets import load_dataset
ds = load_dataset("parquet", data_files="https://huggingface.co/datasets/HuggingFaceM4/SugarCrepe_swap_obj/resolve/main/data/test-00000-of-00001-ca2ae6017a2336d7.parquet")

PS: Yes, these tests are flaky. We are working on fixing them.

bryant1410 · 2023-09-28T14:29:23Z

Oh, I didn't realize they were deprecated. Thanks for the tip on how to work around this issue!

For future reference, the places to change the code in huggingface_hub would be:

https://github.com/huggingface/huggingface_hub/blob/89cc69105074f1d071e0471144605f3cdfe1dab3/src/huggingface_hub/repocard.py#L506

https://github.com/huggingface/huggingface_hub/blob/89cc69105074f1d071e0471144605f3cdfe1dab3/src/huggingface_hub/utils/_fixes.py#L34

bryant1410 added 2 commits September 27, 2023 17:05

Use LibYAML with PyYAML if available

738d2c2

Update readme.py

326e789

Fix lint issues

3c64ddd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LibYAML with PyYAML if available #6266

Use LibYAML with PyYAML if available #6266

bryant1410 commented Sep 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 27, 2023

bryant1410 commented Sep 27, 2023

bryant1410 commented Sep 28, 2023

mariosasko commented Sep 28, 2023

bryant1410 commented Sep 28, 2023

Use LibYAML with PyYAML if available #6266

Are you sure you want to change the base?

Use LibYAML with PyYAML if available #6266

Conversation

bryant1410 commented Sep 27, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Sep 27, 2023

bryant1410 commented Sep 27, 2023

bryant1410 commented Sep 28, 2023

mariosasko commented Sep 28, 2023

bryant1410 commented Sep 28, 2023

bryant1410 commented Sep 27, 2023 •

edited

Loading