How does CPU affinity interact with cgroups in Linux?

5,765

From the cpusets documentation:

Calls to sched_setaffinity are filtered to just those CPUs allowed in that task's cpuset.

This implies that CPU affinity masks are intersected with the cpus in the cgroup that the process is a member of.

E.g. If the affinity mask of a process includes cores {0, 1, 3} and the process is running on the system cgroup, which is restricted to cores {1, 2}, then the process would be forced to run on core 1.

I'm 99% certain that the htop output is "wrong" to the fact that the processes have not woken up since the cgroups were created, and the display is showing the last core the process ran on.

If I start vim before making my shield, vim forks twice (for some reason), and the deepest child is running on core 2. If I then make the shield, then sleep vim (ctrl+z) and wake it, both processes have moved to core 0. I think this confirms the hypothesis that htop is showing stale information.

You can also inspect /proc/<pid>/status and look at the cpus_allowed_* fields.

E.g. I have a console-kit-daemon process (pid 857) here showing in htop as running on core 3, but in /proc/857/status:

Cpus_allowed:   1
Cpus_allowed_list:      0

I think this is saying that the affinity mask is 0x1, which allows running on only core 1 due to the cgroups: i.e. intersect({0,1,2,3}, {0}) = {0}.

If I can, I'll leave the question open a while to see if any better answer comes up.

Thanks to @davmac for helping with this (on irc).

Share:
5,765

Related videos on Youtube

Edd Barrett
Author by

Edd Barrett

Computer scientist, programmer.

Updated on September 18, 2022

Comments

  • Edd Barrett
    Edd Barrett almost 2 years

    I'm trying to run multi-threaded benchmarks on a set of isolated CPUs. To cut a long story short, I initially tried with isolcpus and taskset, but hit problems. Now I'm playing with cgroups/csets.

    I think the "simple" cset shield use-case should work nicely. I have 4 cores, so I would like to use cores 1-3 for benchmarking (I've also configured these cores to be in adaptive ticks mode), then core 0 can be used for everything else.

    Following the tutorial here, it should be as simple as:

    $ sudo cset shield -c 1-3
    cset: --> shielding modified with:
    cset: "system" cpuset of CPUSPEC(0) with 105 tasks running
    cset: "user" cpuset of CPUSPEC(1-3) with 0 tasks running
    

    So now we have a "shield" which is isolated (the user cset) and core 0 is for everything else (the system cset).

    Alright, looks good so far. Now let's look at htop. The processes should all have been migrated onto CPU 0:

    csets

    Huh? Some of the processes are shown as running on the shielded cores. To rule out the case that htop has a bug, I also tried using taskset to inspect the affinity mask of a process shown as being in the shield.

    Maybe those tasks were unmovable? Let's pluck an arbitrary process shown as running on CPU3 (which should be in the shield) in htop and see if it appears in the system cgroup according to cset:

    $ cset shield -u -v | grep 864
       root       864     1 Soth [gmain]
       vext01    2412  2274 Soth grep 864 
    

    Yep, that's running on the system cgroup according to cset. So htop and cset disagree.

    So what's going on here? Who do I trust: cpu affinities (htop/taskset) or cset?

    I suspect that you are not supposed to use cset and affinities together. Perhaps the shield is working fine, and I should ignore the affinity masks and htop output. Either way, I find this confusing. Can someone shed some light?

    • ewwhite
      ewwhite about 8 years
      Which distribution are you using? I ask because the tools and workflows are different, depending on OS and version.
    • ewwhite
      ewwhite about 8 years
      Oh, okay. In the Redhat world, we have numactl and the cgconfig and cgrules/cgred to streamline what you're doing. These may be available for Debian with some work.
  • Austin Hemmelgarn
    Austin Hemmelgarn about 6 years
    You are correct, the info shown in HTOP is not what core the process is currently on, but the last core it was scheduled on (same goes for anything which uses the same interface for gathering information).