Set default numa policy to "interleave" system wide

numa linux redhat memory central-processing-unit

6,731

If using RHEL/CentOS/Fedora, I'd suggest using the numad daemon. (Red Hat paywall link).

While I don't have much use for the numactl --interleave directive, it seems you've determined that your workload requires it. Can you explain why this is the case in order to provide some better context?

Edit:

It seems that most applications that recommend explicit numactl definition either make a libnuma library call or incorporate numactl in a wrapper script.

For the numad side, there's a configuration option that can be specified on the command line or in /etc/numad.conf...

-K <0|1>
   This option controls whether numad keeps interleaved  memory  spread  across  NUMA  nodes,  or
   attempts to merge interleaved memory to local NUMA nodes.  The default is to merge interleaved
   memory.  This is the appropriate setting to localize processes in a  subset  of  the  system’s
   NUMA  nodes.   If  you  are running a large, single-instance application that allocates inter-
   leaved memory because the workload will have continuous unpredictable memory  access  patterns
   (e.g. a large in-memory database), you might get better results by specifying -K 1 to instruct
   numad to keep interleaved memory distributed.

Some say that trying this with something like numad -K 1 -u X, where X is 100 x core count, may help for this. Try it.

Also see HP's ProLiant Whitepaper on Linux and NUMA.

6,731

BeeOnRope

Updated on September 18, 2022

Comments

BeeOnRope over 1 year

I know it is possible to set the numa mode to "interleave" (see NB below) for a specific process using numactrl --interleave, but I'd like to know if it is possible to make this the system wide default (aka change the "system policy"). For example, if there a kernel boot flag to achieve this?

NB: here I'm talking about the kernel behavior which interleaves allocated pages across NUMA nodes - not the memory controller behavior setting at the BIOS level which interleaves cache lines across
- Admin over 9 years
  
  Which specific OS and version are you using?
- Admin over 9 years
  
  I've heard of premature optimization, but this sounds like premature un-optimization! I'm very curious as to what the use case is for this.
- Admin over 9 years
  
  @MichaelHampton Some databases and large-memory applications recommend this (here, here and here).
- Admin over 9 years
  
  At this point curiosity and the need to test different configurations. One of the favorite responses to any question on stackexchange sites seems to be "why would you want to even do that?!". Well one other common response is "You need to test that (configuration, idea, optmization, etc)". So to test things you need to configure them in different ways...
- Admin over 9 years
  
  @ewwhite We are largely using RHEL, although I'm especially interested on options available on all modern Linux distros.
- Admin about 7 years
  
  Not that it matters, but the "interleave" feature makes sure you get access to all the memory. If you don't interleave, then when you malloc, you get a block close to the core that thread is running on. In some situations, you might deplete from one NUMA block when the other is free - and I believe malloc() won't try the other NUMA block by default. Thus, some database developers think Interleave is better. Whether they are right or not --- the answer being "test, test" as stated here.
- Admin about 7 years
  
  @BrianBulkowski - I think that's mostly not the case. Based on an inspection of the source malloc isn't even NUMA-aware, so the underlying malloc behavior comes largely from the system OS allocation policy (i.e., where sbrk and mmap pages get allocated). The details are available but even there none of the NUMA policies are "allocate on local node or else fail", but rather always fall back. Of course the admin could bind the process using numa policy or cpusets, or the programmer could use numa-specific calls.
BeeOnRope over 9 years

Basically I have a situation where it may be difficult to use numactrl explicitly to launch the process, so I was curious if there was a way to set the fault. The various policies, such as interleave do seem to exist (the kernel uses interleave at startup, for example) so it seemed that there would be some way to set the default.
ewwhite over 9 years

I understand. Use numad instead.
BeeOnRope over 9 years

Based on my limited understanding of numad, it doesn't seem like it can do what I want. It mostly moves memory around after the fact, trying to consolidate working sets that have become spread across nodes - but it doesn't seem like affects the initial allocation node. So it can only help me "de-interleave" but never increase the interleaving.
ewwhite over 9 years

Please give your hardware and OS specifics.
BeeOnRope over 9 years

Let's say RHEL and x86 commodity servers (e.g., Dell poweredge stuff).
ewwhite over 9 years

@BeeOnRope 2-socket machines? Not 4-socket? See my edit above.
BeeOnRope over 9 years

Let's say mostly 2-socket, but does it matter?
ewwhite over 9 years

@BeeOnRope Yeah, it matters a little. I've worked with a lot of 4-socket machines and have had to modify policy with some knowledge of the changes to the underlying architecture. Granted, this was about achieving the best locality, but I just wanted to check if you were dealing with an edge case.