-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new experimental restart policy for large scale model training #922
Conversation
This pull request was exported from Phabricator. Differential Revision: D58684341 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
torchx/specs/api.py
Outdated
""" | ||
|
||
REPLICA = "REPLICA" | ||
APPLICATION = "APPLICATION" | ||
QUORUM = "QUORUM" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the quorum restart policy be checked in? I'm only seeing the specs changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a meta only change to some schedulers there is nothing that has to be handled for existing schedulers at the moment. I didnt want to create an internal specialization of retry policy as that will add more changes and confusion and things in this are anyway not expected to be supported by all schedulers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For existing open source schedulers (i.e. Ray/Volcano) that support some form of elasticity we use num_replicas
and min_replicas
fields below to detect elastic behavior
Could we use that as a trigger for this behavior instead?
I'm not totally opposed to adding in a new restart policy but it's unclear what the defined semantics for this is. Currently RetryPolicy defines service vs gang scheduling -- it's not clear that quorum is orthogonal behavior to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it to HOT_SPARES just to differentiate between this restart policy and elasticity in general for now as we will implicitly be forking the role of the use of min replicas and num replicas which might make things more complicated in the future with MAST. We can see how things evolve on the MAST side and what version of elaticity of replacement we end up with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiukchung I think this is generic enough and now well defined enough to be included in the core APIs. Nothing else supports it currently but if anyone uses it it'll just throw an error.
https://github.com/pytorch/torchx/blob/main/torchx/schedulers/kubernetes_scheduler.py#L396
We could throw a better error rather than just KeyError
This pull request was exported from Phabricator. Differential Revision: D58684341 |
…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341
0622a97
to
b95f24c
Compare
This pull request was exported from Phabricator. Differential Revision: D58684341 |
…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341
b95f24c
to
30e8e92
Compare
This pull request was exported from Phabricator. Differential Revision: D58684341 |
…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341
30e8e92
to
496d0a4
Compare
This pull request was exported from Phabricator. Differential Revision: D58684341 |
…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341
496d0a4
to
5207938
Compare
This pull request was exported from Phabricator. Differential Revision: D58684341 |
…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341
5207938
to
8ec3aaa
Compare
…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341
This pull request was exported from Phabricator. Differential Revision: D58684341 |
8ec3aaa
to
ec8ff02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Chatted offline with Tristan as well updated the description. Need to get this in for testing happy to move to internal specific retry policy if strong opinions as a follow up
Summary: TSIA
Differential Revision: D58684341