Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user-settable server-side Task restart policy, per-AlchemicalNetwork #277

Open
dotsdl opened this issue Jun 5, 2024 · 4 comments · May be fixed by #280
Open

Add user-settable server-side Task restart policy, per-AlchemicalNetwork #277

dotsdl opened this issue Jun 5, 2024 · 4 comments · May be fixed by #280

Comments

@dotsdl
Copy link
Member

dotsdl commented Jun 5, 2024

As many Tasks are executed on compute services running on disparate resources, it's likely that random errors will impact some fraction of the tasks, with some countably-small set of failure modes. Currently, users must examine error tracebacks themselves, then set the Tasks they wish to run again from error to waiting status. This can get tedious, and requires many users to babysit their Tasks, even if on rerun many of these will complete successfully.

Instead of this, we would like to empower users with the ability to set a TaskRestartPolicy on an AlchemicalNetwork, which would encode a list giving:

  • regex pattern of the traceback output to match
  • max number of retries to perform for matching errors
  • other options, such as how strongly to avoid a compute service with the same identifying information as one that previously failed on the Task.

Related to #258.
Likely requires #109 to be implemented in some form to periodically apply server-side restarts given the policies set.

@dotsdl
Copy link
Member Author

dotsdl commented Jun 5, 2024

Thanks to @JenkeScheen for raising this issue in today's user group meeting!

@dotsdl
Copy link
Member Author

dotsdl commented Jun 13, 2024

@ianmkenney would you be willing to begin work on this as a head start on the next major milestone? This of high interest for users, so prioritizing it makes sense for us.

@dotsdl
Copy link
Member Author

dotsdl commented Jul 12, 2024

@ianmkenney can you link your design doc here?

@ianmkenney
Copy link
Collaborator

Here is the link to the design doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment