Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new experimental restart policy for large scale model training #922

Merged
merged 1 commit into from
Jun 18, 2024

Conversation

manav-a
Copy link
Contributor

@manav-a manav-a commented Jun 17, 2024

Summary: TSIA

Differential Revision: D58684341

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58684341

Copy link
Contributor

@andywag andywag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

"""

REPLICA = "REPLICA"
APPLICATION = "APPLICATION"
QUORUM = "QUORUM"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the quorum restart policy be checked in? I'm only seeing the specs changes here.

Copy link
Contributor Author

@manav-a manav-a Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a meta only change to some schedulers there is nothing that has to be handled for existing schedulers at the moment. I didnt want to create an internal specialization of retry policy as that will add more changes and confusion and things in this are anyway not expected to be supported by all schedulers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For existing open source schedulers (i.e. Ray/Volcano) that support some form of elasticity we use num_replicas and min_replicas fields below to detect elastic behavior

Could we use that as a trigger for this behavior instead?

I'm not totally opposed to adding in a new restart policy but it's unclear what the defined semantics for this is. Currently RetryPolicy defines service vs gang scheduling -- it's not clear that quorum is orthogonal behavior to that.

Copy link
Contributor Author

@manav-a manav-a Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to HOT_SPARES just to differentiate between this restart policy and elasticity in general for now as we will implicitly be forking the role of the use of min replicas and num replicas which might make things more complicated in the future with MAST. We can see how things evolve on the MAST side and what version of elaticity of replacement we end up with

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiukchung I think this is generic enough and now well defined enough to be included in the core APIs. Nothing else supports it currently but if anyone uses it it'll just throw an error.

https://github.com/pytorch/torchx/blob/main/torchx/schedulers/kubernetes_scheduler.py#L396

We could throw a better error rather than just KeyError

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58684341

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024
…ytorch#922)

Summary:
Pull Request resolved: pytorch#922

TSIA

Reviewed By: andywag

Differential Revision: D58684341
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58684341

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024
…ytorch#922)

Summary:
Pull Request resolved: pytorch#922

TSIA

Reviewed By: andywag

Differential Revision: D58684341
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58684341

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024
…ytorch#922)

Summary:
Pull Request resolved: pytorch#922

TSIA

Reviewed By: andywag

Differential Revision: D58684341
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58684341

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024
…ytorch#922)

Summary:
Pull Request resolved: pytorch#922

TSIA

Reviewed By: andywag

Differential Revision: D58684341
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58684341

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024
…ytorch#922)

Summary:
Pull Request resolved: pytorch#922

TSIA

Reviewed By: andywag

Differential Revision: D58684341
@manav-a manav-a requested a review from kiukchung June 18, 2024 15:55
…ytorch#922)

Summary:
Pull Request resolved: pytorch#922

TSIA

Reviewed By: andywag

Differential Revision: D58684341
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58684341

Copy link
Contributor

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@manav-a manav-a dismissed kiukchung’s stale review June 18, 2024 20:00

Chatted offline with Tristan as well updated the description. Need to get this in for testing happy to move to internal specific retry policy if strong opinions as a follow up

@facebook-github-bot facebook-github-bot merged commit cb1fec1 into pytorch:main Jun 18, 2024
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants