You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I receive this stack trace today, which shows that the sending tasks to workers failed, but do not provide me with hints on how to fix the issue or diagnose to understand the underlying causes
3707 File "/home/monopi/monopi/model/scripts/train.py", line 595, in main
3708 batch = next(train_iter)
3709 ^^^^^^^^^^^^^^^^
3710 File "/home/monopi/monopi/model/data/drop_invalid.py", line 95, in __next__
3711 batch = next(self._data_iter)
3712 ^^^^^^^^^^^^^^^^^^^^^
3713 File "/home/monopi/monopi/model/data/dali_iterator.py", line 322, in __next__
3714 data = next(self.iterator)
3715 ^^^^^^^^^^^^^^^^^^^
3716 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 189, in __next__
3717 return self._next_impl()
3718 ^^^^^^^^^^^^^^^^^
3719 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 183, in _next_impl
3720 self._schedule_runs()
3721 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/base_iterator.py", line 420, in _schedule_runs
3722 p.schedule_run()
3723 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1183, in schedule_run
3724 self._run_once()
3725 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1400, in _run_once
3726 self._iter_setup()
3727 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1671, in _iter_setup
3728 iters, success = self._run_input_callbacks()
3729 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
3730 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1692, in _run_input_callbacks
3731 group.schedule_and_receive(
3732 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 258, in schedule_and_receive
3733 self.prefetch(pool, context_i, batch_size, epoch_idx)
3734 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 227, in prefetch
3735 while context.scheduled_ahead < self.prefetch_queue_depth and self.schedule_batch(
3736 ^^^^^^^^^^^^^^^^^^^^
3737 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 244, in schedule_batch
3738 return pool.schedule_batch(context_i, work_batch)
3739 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3740 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 786, in schedule_batch
3741 self._distribute(context_i, scheduled_i, dst_chunk_i, minibatches)
3742 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 819, in _distribute
3743 self.pool.send(scheduled_tasks, dedicated_worker_id)
3744 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 466, in send
3745 raise RuntimeError("Sending tasks to workers failed")
3746 RuntimeError: Sending tasks to workers failed
3747 Traceback (most recent call last):
3748 File "/home/monopi/monopi/model/scripts/train.py", line 836, in <module>
3749 register_cfg.cli_with_selectable_config(main)
3750 File "/home/monopi/monopi/model/configs/registered_configs.py", line 1327, in cli_with_selectable_config
3751 return tyro.cli(f, args=sys.argv[3:])
3752 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3753 File "/opt/conda/lib/python3.11/site-packages/tyro/_cli.py", line 217, in cli
3754 return run_with_args_from_cli()
3755 ^^^^^^^^^^^^^^^^^^^^^^^^
3756 File "/home/monopi/monopi/model/scripts/train.py", line 595, in main
3757 batch = next(train_iter)
3758 ^^^^^^^^^^^^^^^^
3759 File "/home/monopi/monopi/model/data/drop_invalid.py", line 95, in __next__
3760 batch = next(self._data_iter)
3761 ^^^^^^^^^^^^^^^^^^^^^
3762 File "/home/monopi/monopi/model/data/dali_iterator.py", line 322, in __next__
3763 data = next(self.iterator)
3764 ^^^^^^^^^^^^^^^^^^^
3765 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 189, in __next__
3766 return self._next_impl()
3767 ^^^^^^^^^^^^^^^^^
3768 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 183, in _next_impl
3769 self._schedule_runs()
3770 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/base_iterator.py", line 420, in _schedule_runs
3771 p.schedule_run()
3772 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1183, in schedule_run
3773 self._run_once()
3774 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1400, in _run_once
3775 self._iter_setup()
3776 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1671, in _iter_setup
3777 iters, success = self._run_input_callbacks()
3778 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
3779 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1692, in _run_input_callbacks
3780 group.schedule_and_receive(
3781 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 258, in schedule_and_receive
3782 self.prefetch(pool, context_i, batch_size, epoch_idx)
3783 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 227, in prefetch
3784 while context.scheduled_ahead < self.prefetch_queue_depth and self.schedule_batch(
3785 ^^^^^^^^^^^^^^^^^^^^
3786 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 244, in schedule_batch
3787 return pool.schedule_batch(context_i, work_batch)
3788 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3789 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 786, in schedule_batch
3790 self._distribute(context_i, scheduled_i, dst_chunk_i, minibatches)
3791 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 819, in _distribute
3792 self.pool.send(scheduled_tasks, dedicated_worker_id)
3793 File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 466, in send
3794 raise RuntimeError("Sending tasks to workers failed")
3795 RuntimeError: Sending tasks to workers failed
Minimum reproducible example
No response
Relevant log output
No response
Other/Misc.
No response
Check for duplicates
I have searched the open bugs/issues and have found no duplicates for this bug report
The text was updated successfully, but these errors were encountered:
Definitely, the message is a bit cryptic. We have the improved error reporting of worker processes failures in the backlog.
In the meantime, the direct cause of the error is the fact that IPC from the main process to the worker processes was shut down, which in all likelihood means the worker process has exited unexpectedly.
While the regular exceptions from the source working on a next sample/batch should be forwarded and reported properly, any lower level failures - in particular - the worker process being killed by the OS due to incorrect memory access, running out of memory or hitting OS shared memory limit will result in this kind of message pointing to the communication with the worker processes.
Version
1.40
Describe the bug.
I receive this stack trace today, which shows that the sending tasks to workers failed, but do not provide me with hints on how to fix the issue or diagnose to understand the underlying causes
Minimum reproducible example
No response
Relevant log output
No response
Other/Misc.
No response
Check for duplicates
The text was updated successfully, but these errors were encountered: