How to change the length of time-slices used by the Linux CPU scheduler?

10,823

Solution 1

For most RHEL7 servers, RedHat suggest increasing sched_min_granularity_ns to 10ms and sched_wakeup_granularity_ns to 15ms. (Source. Technically this link says 10 μs, which would be 1000 times smaller. It is a mistake).

We can try to understand this suggestion in more detail.

Increasing sched_min_granularity_ns

On current Linux kernels, CPU time slices are allocated to tasks by CFS, the Completely Fair Scheduler. CFS can be tuned using a few sysctl settings.

  • kernel.sched_min_granularity_ns
  • kernel.sched_latency_ns
  • kernel.sched_wakeup_granularity_ns

You can set sysctl's temporarily until the next reboot, or permanently in a configuration file which is applied on each boot. To learn how to apply this type of setting, look up "sysctl" or read the short introduction here.

sched_min_granularity_ns is the most prominent setting. In the original sched-design-CFS.txt this was described as the only "tunable" setting, "to tune the scheduler from 'desktop' (low latencies) to 'server' (good batching) workloads."

In other words, we can change this setting to reduce overheads from context-switching, and therefore improve throughput at the cost of responsiveness ("latency").

I think of this CFS setting as mimicking the previous build-time setting, CONFIG_HZ. In the first version of the CFS code, the default value was 1 ms, equivalent to 1000 Hz for "desktop" usage. Other supported values of CONFIG_HZ were 250 Hz (the default), and 100 Hz for the "server" end. 100 Hz was also useful when running Linux on very slow CPUs, this was one of the reasons given when CONFIG_HZ was first added as an build setting on X86.

It sounds reasonable to try changing this value up to 10 ms (i.e. 100 Hz), and measure the results. Remember the sysctls are measured in ns. 1 ms = 1,000,000 ns.

We can see this old-school tuning for 'server' was still very relevant in 2011, for throughput in some high-load benchmark tests: https://events.static.linuxfound.org/slides/2011/linuxcon/lcna2011_rajan.pdf

And perhaps a couple of other settings

The default values of the three settings above look relatively close to each other. It makes me want to keep things simple and multiply them all by the same factor :-). But I tried to look into this and it seems some more specific tuning might also be relevant, since you are tuning for throughput.

sched_wakeup_granularity_ns concerns "wake-up pre-emption". I.e. it controls when a task woken by an event is able to immediately pre-empt the currently running process. The 2011 slides showed performance differences for this setting as well.

See also "Disable WAKEUP_PREEMPT" in this 2010 reference by IBM, which suggests that "for some workloads" this default-on feature "can cost a few percent of CPU utilization".

SUSE Linux has a doc that suggests setting this to larger than half of sched_latency_ns will effectively disable wake-up pre-emption, and then "short duty cycle tasks will be unable to compete with CPU hogs effectively".

The SUSE document also suggest some more detailed descriptions of the other settings. You should definitely check what the current default values are on your own systems though. For example the default values on my system seem slightly different to what the SUSE doc says.

https://www.suse.com/documentation/opensuse121/book_tuning/data/sec_tuning_taskscheduler_cfs.html

If you experiment with any of these scheduling variables, I think you should also be aware that all three are scaled (multiplied) by 1+log_2 of the number of CPUs. This scaling can be disabled using kernel.sched_tunable_scaling. I could be missing something, but this seems surprising e.g. if you are considering the responsiveness of servers providing interactive apps and running at/near full load, and how that responsiveness will vary with the number of CPUs per server.

Suggestion if your workload has large numbers of threads / processes

I also came across a 2013 suggestion, for a couple of other settings, that may gain significant throughput if your workload has large numbers of threads. (Or perhaps more accurately, it re-gains the throughput which they had obtained on pre-CFS kernels).

Ignore CONFIG_HZ

I think you don't need to worry about what CONFIG_HZ is set to. My understanding is it is not relevant on current kernels, assuming you have reasonable timer hardware. See also commit 8f4d37ec073c, "sched: high-res preemption tick", found via this comment in a thread about the change: https://lwn.net/Articles/549754/ .

(If you look at the commit, I wouldn't worry that SCHED_HRTICK depends on X86. That requirement seems to have been dropped in some more recent commit).

Solution 2

It looks like you need the batch-scheduler: use schedtool to run processes under different schedulers. e.g. schedtool -B «Command to be run in batch mode»

Share:
10,823
user2948306
Author by

user2948306

Pronouns: he/him (they/them is fine too). If I link to my bug reports (or patches), I'm the one known as Alan Jenkins.

Updated on September 18, 2022

Comments

  • user2948306
    user2948306 about 1 year

    Is it possible to increase the length of time-slices, which the Linux CPU scheduler allows a process to run for? How could I do this?

    Background knowledge

    This question asks how to reduce how frequently the kernel will force a switch between different processes running on the same CPU. This is the kernel feature described as "pre-emptive multi-tasking". This feature is generally good, because it stops an individual process hogging the CPU and making the system completely non-responsive. However switching between processes has a cost, therefore there is a tradeoff.

    If you have one process which uses all the CPU time it can get, and another process which interacts with the user, then switching more frequently can reduce delayed responses.

    If you have two processes which use all the CPU time they can get, then switching less frequently can allow them to get more work done in the same time.

    Motivation

    I am posting this based on my initial reaction to the question How to change Linux context-switch frequency?

    I do not personally want to change the timeslice. However I vaguely remember this being a thing, with the CONFIG_HZ build-time option. So I want to know what the current situation is. Is the CPU scheduler time-slice still based on CONFIG_HZ?

    Also, in practice build-time tuning is very limiting. For Linux distributions, it is much more practical if they can have a single kernel per CPU architecture, and allow configuring it at runtime or at least at boot-time. If tuning the time-slice is still relevant, is there is a new method which does not lock it down at build-time?

  • user2948306
    user2948306 about 5 years
    Hi! I created this question as a home for my answer. We've all explicitly acknowledged this is a tradeoff v.s. responsiveness. I know that I'm asking about cpu-bound tasks, because that's the only reason fiddling the scheduler parameters can reduce the context switch frequency in the first place. Do you have a preferred alternative phrasing of the question, that would make it more clear to others?
  • chrishollinworth
    chrishollinworth about 5 years
    (really my response should have been a comment rather than an answer). Yes, think it should have a caveat, maybe prefixed with.... "Given a situation where a system is spending a lot of time with high CPU usage and high load....."?
  • user2948306
    user2948306 about 5 years
    I think you had a valid point. I ended up spelling it out a bit differently, sorry. Your formulation makes sense to me, but that is based on my understanding of "load average" as a specific technical term :-). I also added the best link I could find for "context switch overhead", to cite why the cost can be significant. (The number of cycles the kernel spends switching between the tasks is not really what I'm concerned about here).
  • user2948306
    user2948306 almost 5 years
    It's not really what I was aiming at, though it is related. I just updated my answer to note that tuning CFS like this remains a recommendation for most general-purpose RHEL7 servers. sched_batch is something a bit different. sched_batch runs at idle priority, so it will be starved if the system is already CPU-bound.
  • user2948306
    user2948306 almost 5 years
    IMO this makes sched_batch not widely useful on its own. I suppose one example is you could copy ktask, split your work into one normal thread, and N-1 background threads, so you can still make progress, but you don't compete for all the available CPUs (except with other batch tasks).
  • user2948306
    user2948306 almost 5 years
    I've given in :-) and added a motivation section to explain why I thought about this in the first place. I.e. people used to talk about tuning CONFIG_HZ, so what is the situation nowadays?
  • ctrl-alt-delor
    ctrl-alt-delor almost 5 years
    batch only works if all batch tasks are put into the batch scheduler.