-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL:Broadcast collectives are missing from the converted trace but present in the trace_link #161
Comments
Tested the same steps with the latest version of Chakra, installed from repository, 16 Oct, the behavior is the same. |
While looking into this issue I observed that the nccl:broadcast operation is a CPU operation and therefore is does not pass this check from pytorch_converter.py:
What is the reason why this collective is not included in the converted trace? |
Describe the Bug
After running a ResNet50 or TinyLlama2 workload on 4 ranks I see that in the Kineto trace at least one nccl:broadcast collective is observed. In the trace_link file the same collective is observed, but in the converted trace the collective is no longer present. Is this a normal behavior or is it an issue on the Chakra Converter side?
I looked in the converter implementation, but I did not observe any pointers that this should be done - dismiss broadcast collectives. Is there something I missed?
Steps to Reproduce
Using the Chakra version from 6 Sept, after the merge of commit #140.
Expected Behavior
See the nccl:broadcast collective in the converted trace.
Screenshots
This is the trace_link file, the broadcast collective is present.
This is the converted trace, in json format, no broadcast collective can be found - search result is at the bottom of the picture.
The text was updated successfully, but these errors were encountered: