Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I hijacked a stuck VM from the old cloud, played around with it for a while, and tested various ways to resolve our issue with Galaxy not recording Cgroup stats.
The issue I raised with the HTCondor community, here is the email thread for reference
I tried several ways to update Condors Cgroup conf and make it inherit the root Cgroup
controllers
andsubtree_conrtol
and nothing helped.The simple solution is to change the
BASE_CGROUP
path tohtcondor
, which will be here/sys/fs/cgroup/htcondor
unlike the previous one, which is in thesystem.slice/condor.service
. Thesystemd
controlled cgroups are not so easy to change or tweak, which is located under/sys/fs/cgroup/system.slice
. By changing theBASE_CGROUP
path to the root of the Cgroup (/sys/fs/cgroup
), thehtcondor
, which is the child, inherits thecontrollers
andsubtree_control
config from its parent,I also ran a test job as Galaxy and submitted the below job to this test machine.
Universe = vanilla Executable = test_job_cgroup.sh Log = test_job_cgroup.log Output = test_job_cgroup.out Error = test_job_cgroup.err Request_cpus = 1 requirements = (Machine == "vgcnbwc-worker-c120m225-test-0000.novalocal") Queue
test_job_cgroup.sh
script (the snippet is actually from Galaxy, this is what Galaxy adds to every job script)Upon checking the job-specific Cgroup on the test host, we can see that the child Cgroup is being created, and it successfully inherits the
controllers
from the parent.Here is the Cgroups output from the test job
__instrument_cgroup__metrics
NOTE: we need to redeploy all workers to have them properly report the Cgroup stats to Galaxy; we have lost Cgroup stats for almost a year in the Galaxy DB table
job_metric_numeric
The above issue was briefly discussed in here.