-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TSAN violations for Distribution ops #1768
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to spend more time thinking about this, brain not working this late on a Friday. Shared some initial thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, could you confirm that TSAN is clean at this final revision before merge?
42492b1
to
a480caa
Compare
The TSAN violation is due to an unprotected insert shown above and replacing `llvm::RefCountedBase` with its thread-safe alternative `llvm::ThreadSafeRefCountedBase`. In the original implementation of `rendezvous`, there were some nuances that were undesirable. For example, it's preferred to clear out the entry once all processes have read the data and clear out the map instead of letting the next distribution op to clear it out. The current implementation is also incorrect since there is no guarantee that processGroup size is the same as previous op calling `rendezvous`. ``` if (channels_[channelKey].size() == processGroup.size()) channels_[channelKey].clear(); channels_[channelKey].insert(processId, operand); ``` The proposed update enhances the `rendezvous` logic such that each process has a shared pointer to the result object once out of scope, it is automatically deleted. The PR also reenables tests for distribution ops. closes openxla#1755
The TSAN violation is due to an unprotected insert shown above and replacing `llvm::RefCountedBase` with its thread-safe alternative `llvm::ThreadSafeRefCountedBase`. In the original implementation of `rendezvous`, there were some nuances that were undesirable. For example, it's preferred to clear out the entry once all processes have read the data and clear out the map instead of letting the next distribution op to clear it out. The current implementation is also incorrect since there is no guarantee that processGroup size is the same as previous op calling `rendezvous`. ``` if (channels_[channelKey].size() == processGroup.size()) channels_[channelKey].clear(); channels_[channelKey].insert(processId, operand); ``` The proposed update enhances the `rendezvous` logic such that each process has a shared pointer to the result object once out of scope, it is automatically deleted. The PR also reenables tests for distribution ops. closes openxla#1755
The TSAN violation is due to an unprotected insert shown above and replacing
llvm::RefCountedBase
with its thread-safe alternativellvm::ThreadSafeRefCountedBase
.In the original implementation of
rendezvous
, there were some nuances that were undesirable.For example, it's preferred to clear out the entry once all processes have read the data and clear out the map instead of letting the next distribution op to clear it out. The current implementation is also incorrect since there is no guarantee that processGroup size is the same as previous op calling
rendezvous
.The proposed update enhances the
rendezvous
logic such that each process has a shared pointer to the result object once out of scope, it is automatically deleted.The PR also reenables tests for distribution ops.
closes #1755