Set default numa policy to "interleave" system wide
If using RHEL/CentOS/Fedora, I'd suggest using the numad daemon. (Red Hat paywall link).
While I don't have much use for the numactl --interleave
directive, it seems you've determined that your workload requires it. Can you explain why this is the case in order to provide some better context?
Edit:
It seems that most applications that recommend explicit numactl
definition either make a libnuma library call or incorporate numactl
in a wrapper script.
For the numad
side, there's a configuration option that can be specified on the command line or in /etc/numad.conf
...
-K <0|1>
This option controls whether numad keeps interleaved memory spread across NUMA nodes, or
attempts to merge interleaved memory to local NUMA nodes. The default is to merge interleaved
memory. This is the appropriate setting to localize processes in a subset of the system’s
NUMA nodes. If you are running a large, single-instance application that allocates inter-
leaved memory because the workload will have continuous unpredictable memory access patterns
(e.g. a large in-memory database), you might get better results by specifying -K 1 to instruct
numad to keep interleaved memory distributed.
Some say that trying this with something like numad -K 1 -u X
, where X is 100 x core count, may help for this. Try it.
Also see HP's ProLiant Whitepaper on Linux and NUMA.
Related videos on Youtube
BeeOnRope
Updated on September 18, 2022Comments
-
BeeOnRope over 1 year
I know it is possible to set the numa mode to "interleave" (see NB below) for a specific process using
numactrl --interleave
, but I'd like to know if it is possible to make this the system wide default (aka change the "system policy"). For example, if there a kernel boot flag to achieve this?NB: here I'm talking about the kernel behavior which interleaves allocated pages across NUMA nodes - not the memory controller behavior setting at the BIOS level which interleaves cache lines across
-
Admin over 9 yearsWhich specific OS and version are you using?
-
Admin over 9 yearsI've heard of premature optimization, but this sounds like premature un-optimization! I'm very curious as to what the use case is for this.
-
Admin over 9 years
-
Admin over 9 yearsAt this point curiosity and the need to test different configurations. One of the favorite responses to any question on stackexchange sites seems to be "why would you want to even do that?!". Well one other common response is "You need to test that (configuration, idea, optmization, etc)". So to test things you need to configure them in different ways...
-
Admin over 9 years@ewwhite We are largely using RHEL, although I'm especially interested on options available on all modern Linux distros.
-
Admin about 7 yearsNot that it matters, but the "interleave" feature makes sure you get access to all the memory. If you don't interleave, then when you malloc, you get a block close to the core that thread is running on. In some situations, you might deplete from one NUMA block when the other is free - and I believe malloc() won't try the other NUMA block by default. Thus, some database developers think Interleave is better. Whether they are right or not --- the answer being "test, test" as stated here.
-
Admin about 7 years@BrianBulkowski - I think that's mostly not the case. Based on an inspection of the source
malloc
isn't even NUMA-aware, so the underlyingmalloc
behavior comes largely from the system OS allocation policy (i.e., wheresbrk
andmmap
pages get allocated). The details are available but even there none of the NUMA policies are "allocate on local node or else fail", but rather always fall back. Of course the admin could bind the process using numa policy or cpusets, or the programmer could use numa-specific calls.
-
-
BeeOnRope over 9 yearsBasically I have a situation where it may be difficult to use numactrl explicitly to launch the process, so I was curious if there was a way to set the fault. The various policies, such as interleave do seem to exist (the kernel uses interleave at startup, for example) so it seemed that there would be some way to set the default.
-
ewwhite over 9 yearsI understand. Use
numad
instead. -
BeeOnRope over 9 yearsBased on my limited understanding of numad, it doesn't seem like it can do what I want. It mostly moves memory around after the fact, trying to consolidate working sets that have become spread across nodes - but it doesn't seem like affects the initial allocation node. So it can only help me "de-interleave" but never increase the interleaving.
-
ewwhite over 9 yearsPlease give your hardware and OS specifics.
-
BeeOnRope over 9 yearsLet's say RHEL and x86 commodity servers (e.g., Dell poweredge stuff).
-
ewwhite over 9 years@BeeOnRope 2-socket machines? Not 4-socket? See my edit above.
-
BeeOnRope over 9 yearsLet's say mostly 2-socket, but does it matter?
-
ewwhite over 9 years@BeeOnRope Yeah, it matters a little. I've worked with a lot of 4-socket machines and have had to modify policy with some knowledge of the changes to the underlying architecture. Granted, this was about achieving the best locality, but I just wanted to check if you were dealing with an edge case.