-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] NCCL_ERROR in all Dask cuML examples #353
Labels
Comments
keuperj
added
? - Needs Triage
Need team to review and classify
bug
Something isn't working
labels
Aug 5, 2021
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
I get the following error:
RuntimeError: NCCL_ERROR: b'unhandled system error'
when running the cuML examples in the container on DASK.
Steps/Code to reproduce bug
Setup of a DASK scheduler and several DASK worker nodes using the container. DASK is working in this setup (Dashboard shows the workers and the Scikit-Learn reference examples also work in this setup. Also, my setup is working for a single node/multi-GPU setting with cuML on DASK. Only the distributed GPU case fails.
Expected behavior
That distributed cuML on Dask would work
Environment details (please complete the following information):
Additional context
My guess would be that NCCL can't communicate through the docker container.
The text was updated successfully, but these errors were encountered: