Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior when using load_dataset with streaming=True in a for loop #6731

Closed
uApiv opened this issue Mar 12, 2024 · 2 comments
Closed

Comments

@uApiv
Copy link

uApiv commented Mar 12, 2024

Describe the bug

My Code

from datasets import load_dataset
res=[]
for i in [0,1]:
    di=load_dataset(
            "json", 
            data_files='path_to.json', 
            split='train',
            streaming=True, 
        ).map(lambda x: {"source": i})

    res.append(di)

for e in res[0]:
    print(e)

Unexpected Behavior

Data in res[0] has source=1. However the expected value is 0.

FYI

I further switch streaming to False. And the output value is as expected (0). So there may exist bugs in setting streaming=True in a for loop.

Environment

Python 3.8.0
datasets==2.18.0
transformers==4.28.1

Steps to reproduce the bug

  1. Create a Json file with any content.
  2. Run the provided code.
  3. Switch streaming to False and run again to see the expected behavior.

Expected behavior

The expected behavior is the data are mapped with its corresponding value in the for loop.

Environment info

Python 3.8.0

datasets==2.18.0
transformers==4.28.1

Ubuntu 20.04

@lhoestq
Copy link
Member

lhoestq commented Mar 14, 2024

This is normal behavior in python when using lambda: the i defined in your lambda refers to the global variable i in your loop, and i equals to 1 when you run your for e in res[0] line.

You should pass fn_kwargs that will be passed to your lambda instead of using the global variable:

from datasets import load_dataset

res=[]
for i in [0,1]:
    di = load_dataset(
            "json", 
            data_files='path_to.json', 
            split='train',
            streaming=True, 
    ).map(lambda x, source: {"source": source}, fn_kwargs={"source": i})

    res.append(di)

for e in res[0]:
    print(e)

This doesn't happen in non-streaming since in that case map is executed while the variable i has the right value. In streaming mode, map is executed on-the-fly when you iterate on the dataset.

@uApiv
Copy link
Author

uApiv commented Apr 16, 2024

Thank you very much for your answer. I think this issue can be closed now.

@uApiv uApiv closed this as completed Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants