You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The system will generate the number of aggregators (either top aggregators or middle aggregators) in terms of the number of values in groupBy tags (i.e., one pod for each groupBy tag), but not necessarily all of the pod resources will be utilized.
For instance, if I specify two tags default/us and default/eu in schema.json with only one dataset with realm default/us/west, the system will create three agents (i.e., two aggregators and one trainer). While one of the two aggregators and the trainer are doing what they're supposed to do, the other aggregator is hanging/idle, which also causes the status of all three tasks to show "running" although two of them should be "completed". Currently, this type of job will only finish when time runs out.
To Reproduce
Steps to reproduce the behavior:
Go to /examples/mnist
Added default/eu next by default/us in schema.json.
Run the example as the tutorial instructed.
See the job never ends, showing "running" in the dashboard. Log into all pods and you will find one is idle.
Expected behavior
Not allocating the spare resource in the first place, or having a different ending mechanism when there is a spare resource
The text was updated successfully, but these errors were encountered:
Describe the bug
The system will generate the number of aggregators (either top aggregators or middle aggregators) in terms of the number of values in groupBy tags (i.e., one pod for each groupBy tag), but not necessarily all of the pod resources will be utilized.
For instance, if I specify two tags
default/us
anddefault/eu
inschema.json
with only one dataset with realmdefault/us/west
, the system will create three agents (i.e., two aggregators and one trainer). While one of the two aggregators and the trainer are doing what they're supposed to do, the other aggregator is hanging/idle, which also causes the status of all three tasks to show "running" although two of them should be "completed". Currently, this type of job will only finish when time runs out.To Reproduce
Steps to reproduce the behavior:
/examples/mnist
default/eu
next bydefault/us
inschema.json
.Expected behavior
Not allocating the spare resource in the first place, or having a different ending mechanism when there is a spare resource
The text was updated successfully, but these errors were encountered: