[BUG] Spare Resources Created by groupBy Tag #109

GaoxiangLuo · 2022-04-21T00:01:44Z

Describe the bug
The system will generate the number of aggregators (either top aggregators or middle aggregators) in terms of the number of values in groupBy tags (i.e., one pod for each groupBy tag), but not necessarily all of the pod resources will be utilized.

For instance, if I specify two tags default/us and default/eu in schema.json with only one dataset with realm default/us/west, the system will create three agents (i.e., two aggregators and one trainer). While one of the two aggregators and the trainer are doing what they're supposed to do, the other aggregator is hanging/idle, which also causes the status of all three tasks to show "running" although two of them should be "completed". Currently, this type of job will only finish when time runs out.

To Reproduce
Steps to reproduce the behavior:

Go to /examples/mnist
Added default/eu next by default/us in schema.json.
Run the example as the tutorial instructed.
See the job never ends, showing "running" in the dashboard. Log into all pods and you will find one is idle.

Expected behavior
Not allocating the spare resource in the first place, or having a different ending mechanism when there is a spare resource

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Spare Resources Created by groupBy Tag #109

[BUG] Spare Resources Created by groupBy Tag #109

GaoxiangLuo commented Apr 21, 2022

[BUG] Spare Resources Created by groupBy Tag #109

[BUG] Spare Resources Created by groupBy Tag #109

Comments

GaoxiangLuo commented Apr 21, 2022