Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the environment variable for huggingface cache #7200

Merged

Conversation

torotoki
Copy link
Contributor

@torotoki torotoki commented Oct 5, 2024

Resolve #6256. As far as I tested, HF_DATASETS_CACHE was ignored and I could not specify the cache directory at all except for the default one by this environment variable. HF_HOME has worked. Perhaps the recent change on file downloading by huggingface_hub could affect this bug.

In my testing, I could not specify the cache directory even by load_dataset("dataset_name" cache_dir="..."). It might be another issue. I also welcome any advice to solve this issue.

@lhoestq
Copy link
Member

lhoestq commented Oct 8, 2024

Hi ! yes now datasets uses huggingface_hub to download and cache files from the HF Hub so you need to use HF_HOME (or manually HF_HUB_CACHE and HF_DATASETS_CACHE if you want to separate HF Hub cached files and cached datasets Arrow files)

So in your change I guess it needs to be HF_HOME instead of HF_CACHE ?

@torotoki
Copy link
Contributor Author

torotoki commented Oct 8, 2024

Thank you for your comment. You are right. I am sorry for my mistake, I fixed it.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks !

@lhoestq lhoestq merged commit 74b0dd3 into huggingface:main Oct 8, 2024
2 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yukiman76
Copy link

yukiman76 commented Oct 30, 2024

I just had this issue, and needed to move the setting the env code in the python file to top, before the import of the lib
ie.

import os
LOCAL_DISK_MOUNT = '/mnt/data'

os.environ['HF_HOME'] = f'{LOCAL_DISK_MOUNT}/hf_cache/'
os.environ['HF_DATASETS_CACHE'] = f'{LOCAL_DISK_MOUNT}/datasets/'

from datasets import load_dataset
from datasets import load_dataset_builder
from psutil._common import bytes2human

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_dataset() function's cache_dir does not seems to work
4 participants