Add a new experimental restart policy for large scale model training #922

manav-a · 2024-06-17T18:45:14Z

Summary: TSIA

Differential Revision: D58684341

facebook-github-bot · 2024-06-17T18:45:41Z

This pull request was exported from Phabricator. Differential Revision: D58684341

andywag

lgtm

kiukchung · 2024-06-17T22:59:05Z

torchx/specs/api.py

    """

    REPLICA = "REPLICA"
    APPLICATION = "APPLICATION"
+    QUORUM = "QUORUM"


will the quorum restart policy be checked in? I'm only seeing the specs changes here.

Its a meta only change to some schedulers there is nothing that has to be handled for existing schedulers at the moment. I didnt want to create an internal specialization of retry policy as that will add more changes and confusion and things in this are anyway not expected to be supported by all schedulers

For existing open source schedulers (i.e. Ray/Volcano) that support some form of elasticity we use num_replicas and min_replicas fields below to detect elastic behavior

Could we use that as a trigger for this behavior instead?

I'm not totally opposed to adding in a new restart policy but it's unclear what the defined semantics for this is. Currently RetryPolicy defines service vs gang scheduling -- it's not clear that quorum is orthogonal behavior to that.

Updated it to HOT_SPARES just to differentiate between this restart policy and elasticity in general for now as we will implicitly be forking the role of the use of min replicas and num replicas which might make things more complicated in the future with MAST. We can see how things evolve on the MAST side and what version of elaticity of replacement we end up with

@kiukchung I think this is generic enough and now well defined enough to be included in the core APIs. Nothing else supports it currently but if anyone uses it it'll just throw an error.

https://github.com/pytorch/torchx/blob/main/torchx/schedulers/kubernetes_scheduler.py#L396

We could throw a better error rather than just KeyError

facebook-github-bot · 2024-06-18T00:33:19Z

This pull request was exported from Phabricator. Differential Revision: D58684341

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

facebook-github-bot · 2024-06-18T00:39:21Z

This pull request was exported from Phabricator. Differential Revision: D58684341

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

facebook-github-bot · 2024-06-18T00:45:10Z

This pull request was exported from Phabricator. Differential Revision: D58684341

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

facebook-github-bot · 2024-06-18T00:49:47Z

This pull request was exported from Phabricator. Differential Revision: D58684341

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

facebook-github-bot · 2024-06-18T01:03:14Z

This pull request was exported from Phabricator. Differential Revision: D58684341

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

facebook-github-bot · 2024-06-18T17:27:26Z

This pull request was exported from Phabricator. Differential Revision: D58684341

d4l3k

LGTM

Chatted offline with Tristan as well updated the description. Need to get this in for testing happy to move to internal specific retry policy if strong opinions as a follow up

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2024

facebook-github-bot added the fb-exported label Jun 17, 2024

andywag approved these changes Jun 17, 2024

View reviewed changes

kiukchung previously requested changes Jun 17, 2024

View reviewed changes

andywag approved these changes Jun 18, 2024

View reviewed changes

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024

Add a new experimental restart policy for large scale model training (p…

b95f24c

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

manav-a force-pushed the export-D58684341 branch from 0622a97 to b95f24c Compare June 18, 2024 00:33

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024

Add a new experimental restart policy for large scale model training (p…

30e8e92

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

manav-a force-pushed the export-D58684341 branch from b95f24c to 30e8e92 Compare June 18, 2024 00:39

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024

Add a new experimental restart policy for large scale model training (p…

496d0a4

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

manav-a force-pushed the export-D58684341 branch from 30e8e92 to 496d0a4 Compare June 18, 2024 00:45

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024

Add a new experimental restart policy for large scale model training (p…

5207938

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

manav-a force-pushed the export-D58684341 branch from 496d0a4 to 5207938 Compare June 18, 2024 00:49

manav-a added a commit to manav-a/torchx that referenced this pull request Jun 18, 2024

Add a new experimental restart policy for large scale model training (p…

8ec3aaa

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

manav-a force-pushed the export-D58684341 branch from 5207938 to 8ec3aaa Compare June 18, 2024 01:03

manav-a requested a review from kiukchung June 18, 2024 15:55

Add a new experimental restart policy for large scale model training (p…

ec8ff02

…ytorch#922) Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341

manav-a force-pushed the export-D58684341 branch from 8ec3aaa to ec8ff02 Compare June 18, 2024 17:27

d4l3k approved these changes Jun 18, 2024

View reviewed changes

facebook-github-bot merged commit cb1fec1 into pytorch:main Jun 18, 2024
23 of 24 checks passed

d4l3k mentioned this pull request Jul 19, 2024

specs: add ROLE restart policy #936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new experimental restart policy for large scale model training #922

Add a new experimental restart policy for large scale model training #922

manav-a commented Jun 17, 2024

facebook-github-bot commented Jun 17, 2024

andywag left a comment

kiukchung Jun 17, 2024

manav-a Jun 17, 2024 •

edited

Loading

d4l3k Jun 18, 2024

manav-a Jun 18, 2024 •

edited

Loading

d4l3k Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

d4l3k left a comment

Add a new experimental restart policy for large scale model training #922

Add a new experimental restart policy for large scale model training #922

Conversation

manav-a commented Jun 17, 2024

facebook-github-bot commented Jun 17, 2024

andywag left a comment

Choose a reason for hiding this comment

kiukchung Jun 17, 2024

Choose a reason for hiding this comment

manav-a Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

d4l3k Jun 18, 2024

Choose a reason for hiding this comment

manav-a Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

d4l3k Jun 18, 2024

Choose a reason for hiding this comment

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

facebook-github-bot commented Jun 18, 2024

d4l3k left a comment

Choose a reason for hiding this comment

manav-a Jun 17, 2024 •

edited

Loading

manav-a Jun 18, 2024 •

edited

Loading