Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we please have more helpful error message ? #5623

Open
1 task done
quanvuong opened this issue Sep 6, 2024 · 1 comment
Open
1 task done

Can we please have more helpful error message ? #5623

quanvuong opened this issue Sep 6, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@quanvuong
Copy link

Version

1.40

Describe the bug.

I receive this stack trace today, which shows that the sending tasks to workers failed, but do not provide me with hints on how to fix the issue or diagnose to understand the underlying causes

3707   File "/home/monopi/monopi/model/scripts/train.py", line 595, in main
3708     batch = next(train_iter)
3709             ^^^^^^^^^^^^^^^^
3710   File "/home/monopi/monopi/model/data/drop_invalid.py", line 95, in __next__
3711     batch = next(self._data_iter)
3712             ^^^^^^^^^^^^^^^^^^^^^
3713   File "/home/monopi/monopi/model/data/dali_iterator.py", line 322, in __next__
3714     data = next(self.iterator)
3715            ^^^^^^^^^^^^^^^^^^^
3716   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 189, in __next__
3717     return self._next_impl()
3718            ^^^^^^^^^^^^^^^^^
3719   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 183, in _next_impl
3720     self._schedule_runs()
3721   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/base_iterator.py", line 420, in _schedule_runs
3722     p.schedule_run()
3723   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1183, in schedule_run
3724     self._run_once()
3725   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1400, in _run_once
3726     self._iter_setup()
3727   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1671, in _iter_setup
3728     iters, success = self._run_input_callbacks()
3729                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
3730   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1692, in _run_input_callbacks
3731     group.schedule_and_receive(
3732   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 258, in schedule_and_receive
3733     self.prefetch(pool, context_i, batch_size, epoch_idx)
3734   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 227, in prefetch
3735     while context.scheduled_ahead < self.prefetch_queue_depth and self.schedule_batch(
3736                                                                   ^^^^^^^^^^^^^^^^^^^^
3737   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 244, in schedule_batch
3738     return pool.schedule_batch(context_i, work_batch)
3739            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3740   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 786, in schedule_batch
3741     self._distribute(context_i, scheduled_i, dst_chunk_i, minibatches)
3742   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 819, in _distribute
3743     self.pool.send(scheduled_tasks, dedicated_worker_id)
3744   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 466, in send
3745     raise RuntimeError("Sending tasks to workers failed")
3746 RuntimeError: Sending tasks to workers failed
3747 Traceback (most recent call last):
3748   File "/home/monopi/monopi/model/scripts/train.py", line 836, in <module>
3749     register_cfg.cli_with_selectable_config(main)
3750   File "/home/monopi/monopi/model/configs/registered_configs.py", line 1327, in cli_with_selectable_config
3751     return tyro.cli(f, args=sys.argv[3:])
3752            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3753   File "/opt/conda/lib/python3.11/site-packages/tyro/_cli.py", line 217, in cli
3754     return run_with_args_from_cli()
3755            ^^^^^^^^^^^^^^^^^^^^^^^^
3756   File "/home/monopi/monopi/model/scripts/train.py", line 595, in main
3757     batch = next(train_iter)
3758             ^^^^^^^^^^^^^^^^
3759   File "/home/monopi/monopi/model/data/drop_invalid.py", line 95, in __next__
3760     batch = next(self._data_iter)
3761             ^^^^^^^^^^^^^^^^^^^^^
3762   File "/home/monopi/monopi/model/data/dali_iterator.py", line 322, in __next__
3763     data = next(self.iterator)
3764            ^^^^^^^^^^^^^^^^^^^
3765   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 189, in __next__
3766     return self._next_impl()
3767            ^^^^^^^^^^^^^^^^^
3768   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/jax/iterator.py", line 183, in _next_impl
3769     self._schedule_runs()
3770   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/plugin/base_iterator.py", line 420, in _schedule_runs
3771     p.schedule_run()
3772   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1183, in schedule_run
3773     self._run_once()
3774   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1400, in _run_once
3775     self._iter_setup()
3776   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1671, in _iter_setup
3777     iters, success = self._run_input_callbacks()
3778                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
3779   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/pipeline.py", line 1692, in _run_input_callbacks
3780     group.schedule_and_receive(
3781   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 258, in schedule_and_receive
3782     self.prefetch(pool, context_i, batch_size, epoch_idx)
3783   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 227, in prefetch
3784     while context.scheduled_ahead < self.prefetch_queue_depth and self.schedule_batch(
3785                                                                   ^^^^^^^^^^^^^^^^^^^^
3786   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/external_source.py", line 244, in schedule_batch
3787     return pool.schedule_batch(context_i, work_batch)
3788            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3789   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 786, in schedule_batch
3790     self._distribute(context_i, scheduled_i, dst_chunk_i, minibatches)
3791   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 819, in _distribute
3792     self.pool.send(scheduled_tasks, dedicated_worker_id)
3793   File "/opt/conda/lib/python3.11/site-packages/nvidia/dali/_multiproc/pool.py", line 466, in send
3794     raise RuntimeError("Sending tasks to workers failed")
3795 RuntimeError: Sending tasks to workers failed

Minimum reproducible example

No response

Relevant log output

No response

Other/Misc.

No response

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@quanvuong quanvuong added the bug Something isn't working label Sep 6, 2024
@stiepan
Copy link
Member

stiepan commented Sep 11, 2024

Hi @quanvuong,

Definitely, the message is a bit cryptic. We have the improved error reporting of worker processes failures in the backlog.

In the meantime, the direct cause of the error is the fact that IPC from the main process to the worker processes was shut down, which in all likelihood means the worker process has exited unexpectedly.

While the regular exceptions from the source working on a next sample/batch should be forwarded and reported properly, any lower level failures - in particular - the worker process being killed by the OS due to incorrect memory access, running out of memory or hitting OS shared memory limit will result in this kind of message pointing to the communication with the worker processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants