Change condor base cgroup path #309

sanjaysrikakulam · 2024-09-04T10:35:29Z

I hijacked a stuck VM from the old cloud, played around with it for a while, and tested various ways to resolve our issue with Galaxy not recording Cgroup stats.

The issue I raised with the HTCondor community, here is the email thread for reference

Hey Matthias,

Thank you for sharing! I thought of something similar to your script as a "quick fix" to resolve the problem temporarily.

Clarification:

The "cgroup.subtree_control" under "/sys/fs/cgroup/" and "/sys/fs/cgroup/system.slice" are created correctly.

Our BASE_CGROUP = system.slice/condor.service

Basically:

/sys/fs/cgroup/
    ├── cgroup.controllers 
    ├── cgroup.subtree_control
    ├── system.slice/
        ├── cgroup.controllers
        ├── cgroup.subtree_control
        ├── condor.service/
            ├── cgroup.controllers
            ├── cgroup.subtree_control (empty)
            └── <HTCondor jobs/subgroups>/
                ├── cgroup.controllers (empty)
                └── cgroup.subtree_control (empty)

I hope this adds more clarity to my question. Not sure why HTCondor is not inheriting the parent "cgroup.subtree_control" correctly from the "system.slice" and probably this is the reason why the job/subgroup specific dirs are not getting configured properly. I will set up a test instance and see if the "quick fix" works for me. I hope someone has a fix to our problem.

On 8/15/2024 5:18 PM, Matthias Schnepf wrote:
> Hi,
>
> I'm not sure why at point 6 of your "cgroup.subtree_control" file is empty and what manages it (condor or systemd, I think).
> We have a similar problem that the cgroup.the controller does not get set correctly.
> I hope someone else has an idea to fix your/our problem with the empty "cgroup.subtree_control" file.
>
> But here an idea of our "quick fix" we currently use.
> We use the development version of condor (23.7.2) and RHEL8.
> Our condor settings for cgroup v2 are:
>
> BASE_CGROUP = htcondor
> CGROUP_MEMORY_LIMIT_POLICY = custom
> CGROUP_HARD_MEMORY_LIMIT_EXPR = 2 * Target.RequestMemory
> CGROUP_LOW_MEMORY_LIMIT = 0.75 * Target.RequestMemory
>
> The job cgroups are created in /sys/fs/cgroup/htcondor. We set the cgroup.subtree_control file via a cronjob at boot time.
>
>
> #!/bin/bash
>
> echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/cgroup.subtree_control
> export cgroup_name="/sys/fs/cgroup/htcondor"
> if [ ! -d ${cgroup_name} ]; then
>     mkdir ${cgroup_name}
> fi
> echo +cpu +cpuset +memory +pids >> /sys/fs/cgroup/htcondor/cgroup.subtree_control
>
> With that, CPU, memory, and pids controller are set for the htcondor cgroup and its jobs/subgroups. With that, condor sets the correct memory limits, CPU weights, and monitors the memory.
>
> Best regards,
>
> Matthias
>
>
> On 8/15/24 4:47 PM, Sanjay Kumar Srikakulam wrote:
>> Hi,
>>
>> We run an HTCondor cluster and recently noticed we are missing the Cgroups accounting. Our setup,
>>
>> HTCondor:
>>
>> $CondorVersion: 23.0.6 2024-03-14 BuildID: 720565 PackageID: 23.0.6-1 $
>> $CondorPlatform: x86_64_AlmaLinux9 $
>>
>> 1. We are using Rocky 9 on workers
>> 2. CgroupV2 is mounted on the workers
>> 3. CgroupV2 controllers file as the list: "cpuset cpu io memory hugetlb pids rdma misc"
>> 4. HTCondor is configured to use CGroups:
>>
>> BASE_CGROUP = system.slice/condor.service
>> CGROUP_MEMORY_LIMIT_POLICY = hard
>> RESERVED_MEMORY = 2048
>>
>> 5. I can see the "condor.service" directory under "/sys/fs/cgroup/system.slice"
>> 6. HTCondor is inheriting the parent controllers properly: I see the "cgroup.controllers" file and has the same list of controllers as the parent (above). However, the "cgroup.subtree_control" file is empty (the parent has the list of controller names; so this is not getting created or inherited properly)
>> 7. As per the HTCondor doc (https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#cgroup-based-process-tracking), that once the BASE_CGROUP is defined, for every condor job there will be a dedicated dir in the BASE_CGROUP path for cgroup accounting. When jobs are submitted, I see the subdirectories "condor_var_lib_condor_execute_slot1_7@hostname". However, the "cgroup.controllers" file is empty in these sub-directories and is somehow not inheriting the parent. Similarly, the "cgroup.subtree_control" file is also empty.
>>
>> 8. We also added the "CREATE_CGROUP_WITHOUT_ROOT = True" to our HTCondor config and restarted the condor services without luck.
>> 9. Also, from the starter log: "StarterLog.slot1_1:08/15/24 14:21:09 (pid:3758318) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/condor.service/cgroup.subtree_control: Device or resource busy", HTCondor seem to be hitting the "no internal processes" rule (https://unix.stackexchange.com/questions/680167/ebusy-when-trying-to-add-process-to-cgroup-v2; https://manpath.be/f35/7/cgroups#L557).
>>
>> Any help on resolving this is much appreciated!

I tried several ways to update Condors Cgroup conf and make it inherit the root Cgroup controllers and subtree_conrtol and nothing helped.

The simple solution is to change the BASE_CGROUP path to htcondor, which will be here /sys/fs/cgroup/htcondor unlike the previous one, which is in the system.slice/condor.service. The systemd controlled cgroups are not so easy to change or tweak, which is located under /sys/fs/cgroup/system.slice. By changing the BASE_CGROUP path to the root of the Cgroup (/sys/fs/cgroup), the htcondor, which is the child, inherits the controllers and subtree_control config from its parent,

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ cat htcondor/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ cat htcondor/cgroup.subtree_control
cpu io memory pids

I also ran a test job as Galaxy and submitted the below job to this test machine.

Universe = vanilla
Executable = test_job_cgroup.sh
Log = test_job_cgroup.log
Output = test_job_cgroup.out
Error = test_job_cgroup.err
Request_cpus = 1
requirements = (Machine == "vgcnbwc-worker-c120m225-test-0000.novalocal")
Queue

test_job_cgroup.sh script (the snippet is actually from Galaxy, this is what Galaxy adds to every job script)

#!/bin/bash
echo "Hello World"
echo $(hostname)

sleep 60

for ((i=1; i<=10000000; i++)); do
    :
done

# Cgroup stuff added by Galaxy to each job script
if [ -e "/proc/$$/cgroup" -a -d "/sys/fs/cgroup" -a ! -f "/sys/fs/cgroup/cgroup.controllers" ]; then
    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '($2=="cpuacct,cpu") || ($2=="cpu,cpuacct") {print $3}');

    if [ ! -e "/sys/fs/cgroup/cpu$cgroup_path/cpuacct.usage" ]; then
        cgroup_path="";
    fi;

    for f in /sys/fs/cgroup/{cpu\,cpuacct,cpuacct\,cpu}$cgroup_path/{cpu,cpuacct}.*; do
        if [ -f "$f" ]; then
            echo "__$(basename $f)__" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics;
            cat "$f" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics 2>/dev/null;
        fi;
    done;

    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '$2=="memory"{print $3}');

    if [ ! -e "/sys/fs/cgroup/memory$cgroup_path/memory.max_usage_in_bytes" ]; then
        cgroup_path="";
    fi;

    for f in /sys/fs/cgroup/memory$cgroup_path/memory.*; do
        echo "__$(basename $f)__" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics;
        cat "$f" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics 2>/dev/null;
    done;
fi

if [ -e "/proc/$$/cgroup" -a -f "/sys/fs/cgroup/cgroup.controllers" ]; then
    cgroup_path=$(cat "/proc/$$/cgroup" | awk -F':' '($1=="0") {print $3}');

    echo "$cgroup_path"
    ls -la /sys/fs/cgroup/${cgroup_path}/
    for f in /sys/fs/cgroup/${cgroup_path}/{cpu,memory}.*; do
        echo "__$(basename $f)__" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics;
        cat "$f" >> /data/jwd05e/main/test_condor_submit_cgroup/__instrument_cgroup__metrics 2>/dev/null;
    done;
fi

sleep 10

Upon checking the job-specific Cgroup on the test host, we can see that the child Cgroup is being created, and it successfully inherits the controllers from the parent.

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ ll htcondor/
total 0
-r--r--r--. 1 root root 0 Sep  4 11:56 cgroup.controllers
-r--r--r--. 1 root root 0 Sep  4 11:56 cgroup.events
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.freeze
--w-------. 1 root root 0 Sep  4 11:56 cgroup.kill
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.max.depth
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.max.descendants
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.procs
-r--r--r--. 1 root root 0 Sep  4 11:56 cgroup.stat
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.subtree_control
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.threads
-rw-r--r--. 1 root root 0 Sep  4 11:56 cgroup.type
drwxr-xr-x. 2 root root 0 Sep  4 11:56 condor_var_lib_condor_execute_slot1_1@vgcnbwc-worker-c120m225-test-0000.novalocal
-rw-r--r--. 1 root root 0 Sep  4 11:56 cpu.idle                                                                                                                                                                                                                                                                .....
.....

root@vgcnbwc-worker-c120m225-test-0000:/sys/fs/cgroup$ cat htcondor/condor_var_lib_condor_execute_slot1_1@vgcnbwc-worker-c120m225-test-0000.novalocal/cgroup.controllers
cpu io memory pids

Here is the Cgroups output from the test job

__instrument_cgroup__metrics

__cpu.idle__
0
__cpu.max__
max 100000
__cpu.max.burst__
0
__cpu.stat__
usage_usec 31291908
user_usec 25429593
system_usec 5862315
core_sched.force_idle_usec 0
nr_periods 0
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0
__cpu.weight__
100
__cpu.weight.nice__
0
__memory.current__
2408448
__memory.events__
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0
__memory.events.local__
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0
__memory.high__
max
__memory.low__
0
__memory.max__
134217728
__memory.min__
0
__memory.numa_stat__
anon N0=241664
file N0=45056
kernel_stack N0=16384
pagetables N0=36864
sec_pagetables N0=0
shmem N0=0
file_mapped N0=0
file_dirty N0=0
file_writeback N0=0
swapcached N0=0
anon_thp N0=0
file_thp N0=0
shmem_thp N0=0
inactive_anon N0=221184
active_anon N0=4096
inactive_file N0=40960
active_file N0=4096
unevictable N0=0
slab_reclaimable N0=34936
slab_unreclaimable N0=129288
workingset_refault_anon N0=0
workingset_refault_file N0=0
workingset_activate_anon N0=0
workingset_activate_file N0=0
workingset_restore_anon N0=0
workingset_restore_file N0=0
workingset_nodereclaim N0=0
__memory.oom.group__
1
__memory.peak__
3530752
__memory.reclaim__
__memory.stat__
anon 237568
file 45056
kernel 761856
kernel_stack 16384
pagetables 32768
sec_pagetables 0
percpu 0
sock 0
vmalloc 0
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 217088
active_anon 4096
inactive_file 40960
active_file 4096
unevictable 0
slab_reclaimable 34936
slab_unreclaimable 136120
slab 171056
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 0
pgsteal 0
pgscan_kswapd 0
pgscan_direct 0
pgsteal_kswapd 0
pgsteal_direct 0
pgfault 5852
pgmajfault 1
pgrefill 0
pgactivate 1
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 0
thp_collapse_alloc 0
__memory.swap.current__
0
__memory.swap.events__
high 0
max 0
fail 0
__memory.swap.high__
max
__memory.swap.max__
max
__memory.zswap.current__
0
__memory.zswap.max__
max

NOTE: we need to redeploy all workers to have them properly report the Cgroup stats to Galaxy; we have lost Cgroup stats for almost a year in the Galaxy DB table job_metric_numeric

The above issue was briefly discussed in here.

bgruening · 2024-09-04T10:43:42Z

I can not say that I understand everything, but it sounds all clear. +1 from my side.

Please talk to Manuel, he will also redeploy new images to the cloud hosts in the next days, so we can piggy back on that and deploy now VM images as well.

Thanks @sanjaysrikakulam

sanjaysrikakulam · 2024-09-04T11:45:30Z

I can not say that I understand everything, but it sounds all clear. +1 from my side.

Please talk to Manuel, he will also redeploy new images to the cloud hosts in the next days, so we can piggy back on that and deploy now VM images as well.

Thanks @sanjaysrikakulam

Yup, that's my plan as well.

mira-miracoli

Thank you! I think BASE_CGROUP=htcondor should be a safe option; it is also the default value according to the documentation.

Change condor base cgroup path

10a2ab1

sanjaysrikakulam requested review from bgruening, sj213 and mira-miracoli September 4, 2024 10:35

mira-miracoli approved these changes Sep 5, 2024

View reviewed changes

sanjaysrikakulam merged commit 25198b7 into usegalaxy-eu:main Sep 5, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change condor base cgroup path #309

Change condor base cgroup path #309

sanjaysrikakulam commented Sep 4, 2024 •

edited

Loading

bgruening commented Sep 4, 2024

sanjaysrikakulam commented Sep 4, 2024

mira-miracoli left a comment •

edited

Loading

Change condor base cgroup path #309

Change condor base cgroup path #309

Conversation

sanjaysrikakulam commented Sep 4, 2024 • edited Loading

bgruening commented Sep 4, 2024

sanjaysrikakulam commented Sep 4, 2024

mira-miracoli left a comment • edited Loading

Choose a reason for hiding this comment

sanjaysrikakulam commented Sep 4, 2024 •

edited

Loading

mira-miracoli left a comment •

edited

Loading