-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping gets stuck at 99% #6077
Comments
The Also, these arrays are big, so it makes sense to reduce |
Hi @mariosasko ! I agree, it's an ugly hack, but it was convenient since the resulting |
Have you tried to reduce |
I think Here is also a bunch of stack traces when I interrupted the process: stack trace 1(pyg)[d623204@rosetta-bigviz01 stage-laurent-f]$ python src/random_scripts/uses_random_data.py
Found cached dataset random_data (/local_scratch/lfainsin/.cache/huggingface/datasets/random_data/default/0.0.0/444e214e1d0e6298cfd3f2368323ec37073dc1439f618e19395b1f421c69b066)
Applying mean/std: 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 967/1000 [00:01<00:00, 534.87 examples/s]Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 179, in __arrow_array__
storage = to_pyarrow_listarray(data, pa_type)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 1466, in to_pyarrow_listarray
return pa.array(data, pa_type.storage_dtype)
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Could not convert tensor([[-1.0273, -0.8037, -0.6860],
[-0.5034, -1.2685, -0.0558],
[-1.0908, -1.1820, -0.3178],
...,
[-0.8171, 0.1781, -0.5903],
[ 0.4370, 1.9305, 0.5899],
[-0.1426, 0.9053, -1.7559]]) with type Tensor: was not a sequence or recognized null for conversion to list type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3449, in _map_single
writer.write(example)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 490, in write
self.write_examples_on_file()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 448, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 553, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 223, in __arrow_array__
return pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 446, in cast_to_python_objects
return _cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 407, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 408, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 319, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 320, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 263, in _cast_to_python_objects
def _cast_to_python_objects(obj: Any, only_1d_for_numpy: bool, optimize_list_casting: bool) -> Tuple[Any, bool]:
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 179, in __arrow_array__
storage = to_pyarrow_listarray(data, pa_type)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 1466, in to_pyarrow_listarray
return pa.array(data, pa_type.storage_dtype)
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Could not convert tensor([[-1.0273, -0.8037, -0.6860],
[-0.5034, -1.2685, -0.0558],
[-1.0908, -1.1820, -0.3178],
...,
[-0.8171, 0.1781, -0.5903],
[ 0.4370, 1.9305, 0.5899],
[-0.1426, 0.9053, -1.7559]]) with type Tensor: was not a sequence or recognized null for conversion to list type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs_new/data/users/lfainsin/stage-laurent-f/src/random_scripts/uses_random_data.py", line 62, in <module>
ds_normalized = ds.map(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 580, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 545, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3087, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3492, in _map_single
writer.finalize()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 584, in finalize
self.write_examples_on_file()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 448, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 553, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 223, in __arrow_array__
return pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 446, in cast_to_python_objects
return _cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 407, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 408, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 319, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 319, in <listcomp>
[
KeyboardInterrupt stack trace 2(pyg)[d623204@rosetta-bigviz01 stage-laurent-f]$ python src/random_scripts/uses_random_data.py
Found cached dataset random_data (/local_scratch/lfainsin/.cache/huggingface/datasets/random_data/default/0.0.0/444e214e1d0e6298cfd3f2368323ec37073dc1439f618e19395b1f421c69b066)
Applying mean/std: 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 988/1000 [00:20<00:00, 526.19 examples/s]Applying mean/std: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [00:21<00:00, 9.66 examples/s]Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 179, in __arrow_array__
storage = to_pyarrow_listarray(data, pa_type)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 1466, in to_pyarrow_listarray
return pa.array(data, pa_type.storage_dtype)
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Could not convert tensor([[-1.0273, -0.8037, -0.6860],
[-0.5034, -1.2685, -0.0558],
[-1.0908, -1.1820, -0.3178],
...,
[-0.8171, 0.1781, -0.5903],
[ 0.4370, 1.9305, 0.5899],
[-0.1426, 0.9053, -1.7559]]) with type Tensor: was not a sequence or recognized null for conversion to list type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3449, in _map_single
writer.write(example)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 490, in write
self.write_examples_on_file()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 448, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 553, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 223, in __arrow_array__
return pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 446, in cast_to_python_objects
return _cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 407, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 408, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 319, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 320, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 263, in _cast_to_python_objects
def _cast_to_python_objects(obj: Any, only_1d_for_numpy: bool, optimize_list_casting: bool) -> Tuple[Any, bool]:
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 179, in __arrow_array__
storage = to_pyarrow_listarray(data, pa_type)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 1466, in to_pyarrow_listarray
return pa.array(data, pa_type.storage_dtype)
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Could not convert tensor([[-1.0273, -0.8037, -0.6860],
[-0.5034, -1.2685, -0.0558],
[-1.0908, -1.1820, -0.3178],
...,
[-0.8171, 0.1781, -0.5903],
[ 0.4370, 1.9305, 0.5899],
[-0.1426, 0.9053, -1.7559]]) with type Tensor: was not a sequence or recognized null for conversion to list type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs_new/data/users/lfainsin/stage-laurent-f/src/random_scripts/uses_random_data.py", line 62, in <module>
ds_normalized = ds.map(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 580, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 545, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3087, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3492, in _map_single
writer.finalize()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 584, in finalize
self.write_examples_on_file()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 448, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 553, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 223, in __arrow_array__
return pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 446, in cast_to_python_objects
return _cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 407, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 408, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 319, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 320, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 291, in _cast_to_python_objects
if config.JAX_AVAILABLE and "jax" in sys.modules:
KeyboardInterrupt stack trace 3(pyg)[d623204@rosetta-bigviz01 stage-laurent-f]$ python src/random_scripts/uses_random_data.py
Found cached dataset random_data (/local_scratch/lfainsin/.cache/huggingface/datasets/random_data/default/0.0.0/444e214e1d0e6298cfd3f2368323ec37073dc1439f618e19395b1f421c69b066)
Applying mean/std: 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 989/1000 [00:01<00:00, 504.80 examples/s]Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 179, in __arrow_array__
storage = to_pyarrow_listarray(data, pa_type)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 1466, in to_pyarrow_listarray
return pa.array(data, pa_type.storage_dtype)
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Could not convert tensor([[-1.0273, -0.8037, -0.6860],
[-0.5034, -1.2685, -0.0558],
[-1.0908, -1.1820, -0.3178],
...,
[-0.8171, 0.1781, -0.5903],
[ 0.4370, 1.9305, 0.5899],
[-0.1426, 0.9053, -1.7559]]) with type Tensor: was not a sequence or recognized null for conversion to list type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3449, in _map_single
writer.write(example)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 490, in write
self.write_examples_on_file()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 448, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 553, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 223, in __arrow_array__
return pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 446, in cast_to_python_objects
return _cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 407, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 408, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 319, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 320, in <listcomp>
_cast_to_python_objects(
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 179, in __arrow_array__
storage = to_pyarrow_listarray(data, pa_type)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 1466, in to_pyarrow_listarray
return pa.array(data, pa_type.storage_dtype)
File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Could not convert tensor([[-1.0273, -0.8037, -0.6860],
[-0.5034, -1.2685, -0.0558],
[-1.0908, -1.1820, -0.3178],
...,
[-0.8171, 0.1781, -0.5903],
[ 0.4370, 1.9305, 0.5899],
[-0.1426, 0.9053, -1.7559]]) with type Tensor: was not a sequence or recognized null for conversion to list type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/gpfs_new/data/users/lfainsin/stage-laurent-f/src/random_scripts/uses_random_data.py", line 62, in <module>
ds_normalized = ds.map(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 580, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 545, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3087, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3492, in _map_single
writer.finalize()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 584, in finalize
self.write_examples_on_file()
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 448, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 553, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 236, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/arrow_writer.py", line 223, in __arrow_array__
return pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 446, in cast_to_python_objects
return _cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 407, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 408, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 319, in _cast_to_python_objects
[
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 320, in <listcomp>
_cast_to_python_objects(
File "/local_scratch/lfainsin/.conda/envs/pyg/lib/python3.10/site-packages/datasets/features/features.py", line 298, in _cast_to_python_objects
if obj.ndim == 0:
KeyboardInterrupt |
Same issue by following code: from datasets import load_dataset
from torchvision.transforms import transforms
path = "~/dataset/diffusiondb50k" # path maybe not necessary
dataset = load_dataset("poloclub/diffusiondb", "2m_first_1k", data_dir=path)
transform = transforms.Compose([transforms.ToTensor()])
dataset = dataset.map(
lambda x: {
'image': transform(x['image']),
'prompt': x['prompt'],
'width': x['width'],
'height': x['height'],
},
# num_proc=4,
)
dataset And the Also, there is 1 process left in Environment Info
|
Hi @zmoki688, I've noticed since that it's pretty common for disk writes to lag behind the operations performed by the |
Describe the bug
Hi !
I'm currently working with a large (~150GB) unnormalized dataset at work.
The dataset is available on a read-only filesystem internally, and I use a loading script to retreive it.
I want to normalize the features of the dataset, meaning I need to compute the mean and standard deviation metric for each feature of the entire dataset. I cannot load the entire dataset to RAM as it is too big, so following this discussion on the huggingface discourse I am using a map operation to first compute the metrics and a second map operation to apply them on the dataset.
The problem lies in the second mapping, as it gets stuck at ~99%. By checking what the process does (using
htop
andstrace
) it seems to be doing a lot of I/O operations, and I'm not sure why.Obviously, I could always normalize the dataset externally and then load it using a loading script. However, since the internal dataset is updated fairly frequently, using the library to perform normalization automatically would make it much easier for me.
Steps to reproduce the bug
I'm able to reproduce the problem using the following scripts:
Expected behavior
Using the previous scripts, the
ds_normalized
mapping completes in ~5 minutes, but any subsequent use ofds_normalized
is really really slow, for example reapplyingapply_mean_std
tods_normalized
takes forever. This is very strange, I'm sure I must be missing something, but I would still expect this to be faster.Environment info
datasets
version: 2.13.1The text was updated successfully, but these errors were encountered: