Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TSAN violations for Distribution ops #1768

Merged
merged 3 commits into from
Sep 12, 2023
Merged

Conversation

ghpvnist
Copy link
Member

@ghpvnist ghpvnist commented Sep 2, 2023

The TSAN violation is due to an unprotected insert shown above and replacing llvm::RefCountedBase with its thread-safe alternative llvm::ThreadSafeRefCountedBase.

In the original implementation of rendezvous, there were some nuances that were undesirable.
For example, it's preferred to clear out the entry once all processes have read the data and clear out the map instead of letting the next distribution op to clear it out. The current implementation is also incorrect since there is no guarantee that processGroup size is the same as previous op calling rendezvous.

if (channels_[channelKey].size() == processGroup.size())
  channels_[channelKey].clear();

channels_[channelKey].insert(processId, operand);

The proposed update enhances the rendezvous logic such that each process has a shared pointer to the result object once out of scope, it is automatically deleted.

The PR also reenables tests for distribution ops.

closes #1755

Copy link
Member

@GleasonK GleasonK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to spend more time thinking about this, brain not working this late on a Friday. Shared some initial thoughts.

stablehlo/reference/Tensor.h Show resolved Hide resolved
stablehlo/reference/ProcessGrid.cpp Outdated Show resolved Hide resolved
stablehlo/reference/ProcessGrid.cpp Outdated Show resolved Hide resolved
stablehlo/reference/Process.cpp Outdated Show resolved Hide resolved
stablehlo/reference/ProcessGrid.cpp Outdated Show resolved Hide resolved
stablehlo/reference/ProcessGrid.cpp Outdated Show resolved Hide resolved
stablehlo/reference/ProcessGrid.cpp Show resolved Hide resolved
stablehlo/reference/ProcessGrid.cpp Show resolved Hide resolved
stablehlo/reference/ProcessGrid.cpp Show resolved Hide resolved
@GleasonK GleasonK assigned ghpvnist and unassigned GleasonK Sep 7, 2023
@ghpvnist ghpvnist assigned GleasonK and unassigned ghpvnist Sep 7, 2023
Copy link
Member

@GleasonK GleasonK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, could you confirm that TSAN is clean at this final revision before merge?

@GleasonK GleasonK assigned ghpvnist and unassigned GleasonK Sep 12, 2023
@ghpvnist ghpvnist merged commit d17a137 into openxla:main Sep 12, 2023
7 checks passed
@ghpvnist ghpvnist deleted the thread_fix branch September 13, 2023 18:31
penagos pushed a commit to penagos/stablehlo that referenced this pull request Sep 21, 2023
The TSAN violation is due to an unprotected insert shown above and
replacing `llvm::RefCountedBase` with its thread-safe alternative
`llvm::ThreadSafeRefCountedBase`.

In the original implementation of `rendezvous`, there were some nuances
that were undesirable.
For example, it's preferred to clear out the entry once all processes
have read the data and clear out the map instead of letting the next
distribution op to clear it out. The current implementation is also
incorrect since there is no guarantee that processGroup size is the same
as previous op calling `rendezvous`.
```
if (channels_[channelKey].size() == processGroup.size())
  channels_[channelKey].clear();

channels_[channelKey].insert(processId, operand);
```

The proposed update enhances the `rendezvous` logic such that each
process has a shared pointer to the result object once out of scope, it
is automatically deleted.

The PR also reenables tests for distribution ops.

closes openxla#1755
penagos pushed a commit to penagos/stablehlo that referenced this pull request Sep 21, 2023
The TSAN violation is due to an unprotected insert shown above and
replacing `llvm::RefCountedBase` with its thread-safe alternative
`llvm::ThreadSafeRefCountedBase`.

In the original implementation of `rendezvous`, there were some nuances
that were undesirable.
For example, it's preferred to clear out the entry once all processes
have read the data and clear out the map instead of letting the next
distribution op to clear it out. The current implementation is also
incorrect since there is no guarantee that processGroup size is the same
as previous op calling `rendezvous`.
```
if (channels_[channelKey].size() == processGroup.size())
  channels_[channelKey].clear();

channels_[channelKey].insert(processId, operand);
```

The proposed update enhances the `rendezvous` logic such that each
process has a shared pointer to the result object once out of scope, it
is automatically deleted.

The PR also reenables tests for distribution ops.

closes openxla#1755
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Fix TSAN violations for Distribution ops
2 participants