You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from datasets import load_dataset
res=[]
for i in [0,1]:
di=load_dataset(
"json",
data_files='path_to.json',
split='train',
streaming=True,
).map(lambda x: {"source": i})
res.append(di)
for e in res[0]:
print(e)
Unexpected Behavior
Data in res[0] has source=1. However the expected value is 0.
FYI
I further switch streaming to False. And the output value is as expected (0). So there may exist bugs in setting streaming=True in a for loop.
This is normal behavior in python when using lambda: the i defined in your lambda refers to the global variable i in your loop, and i equals to 1 when you run your for e in res[0] line.
You should pass fn_kwargs that will be passed to your lambda instead of using the global variable:
This doesn't happen in non-streaming since in that case map is executed while the variable i has the right value. In streaming mode, map is executed on-the-fly when you iterate on the dataset.
Describe the bug
My Code
Unexpected Behavior
Data in
res[0]
hassource=1
. However the expected value is 0.FYI
I further switch
streaming
toFalse
. And the output value is as expected (0). So there may exist bugs in settingstreaming=True
in a for loop.Environment
Python 3.8.0
datasets==2.18.0
transformers==4.28.1
Steps to reproduce the bug
streaming
toFalse
and run again to see the expected behavior.Expected behavior
The expected behavior is the data are mapped with its corresponding value in the for loop.
Environment info
Python 3.8.0
datasets==2.18.0
transformers==4.28.1
Ubuntu 20.04
The text was updated successfully, but these errors were encountered: