Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spare Resources Created by groupBy Tag #109

Open
GaoxiangLuo opened this issue Apr 21, 2022 · 0 comments
Open

[BUG] Spare Resources Created by groupBy Tag #109

GaoxiangLuo opened this issue Apr 21, 2022 · 0 comments

Comments

@GaoxiangLuo
Copy link
Collaborator

Describe the bug
The system will generate the number of aggregators (either top aggregators or middle aggregators) in terms of the number of values in groupBy tags (i.e., one pod for each groupBy tag), but not necessarily all of the pod resources will be utilized.

For instance, if I specify two tags default/us and default/eu in schema.json with only one dataset with realm default/us/west, the system will create three agents (i.e., two aggregators and one trainer). While one of the two aggregators and the trainer are doing what they're supposed to do, the other aggregator is hanging/idle, which also causes the status of all three tasks to show "running" although two of them should be "completed". Currently, this type of job will only finish when time runs out.

To Reproduce
Steps to reproduce the behavior:

  1. Go to /examples/mnist
  2. Added default/eu next by default/us in schema.json.
  3. Run the example as the tutorial instructed.
  4. See the job never ends, showing "running" in the dashboard. Log into all pods and you will find one is idle.

Expected behavior
Not allocating the spare resource in the first place, or having a different ending mechanism when there is a spare resource

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant