Can I configure my Linux system for more aggressive file system caching?

142,704

Solution 1

Improving disk cache performance in general is more than just increasing the file system cache size unless your whole system fits in RAM in which case you should use RAM drive (tmpfs is good because it allows falling back to disk if you need the RAM in some case) for runtime storage (and perhaps an initrd script to copy system from storage to RAM drive at startup).

You didn't tell if your storage device is SSD or HDD. Here's what I've found to work for me (in my case sda is a HDD mounted at /home and sdb is SSD mounted at /).

First optimize the load-stuff-from-storage-to-cache part:

Here's my setup for HDD (make sure AHCI+NCQ is enabled in BIOS if you have toggles):

echo cfq > /sys/block/sda/queue/scheduler
echo 10000 > /sys/block/sda/queue/iosched/fifo_expire_async
echo 250 > /sys/block/sda/queue/iosched/fifo_expire_sync
echo 80 > /sys/block/sda/queue/iosched/slice_async
echo 1 > /sys/block/sda/queue/iosched/low_latency
echo 6 > /sys/block/sda/queue/iosched/quantum
echo 5 > /sys/block/sda/queue/iosched/slice_async_rq
echo 3 > /sys/block/sda/queue/iosched/slice_idle
echo 100 > /sys/block/sda/queue/iosched/slice_sync
hdparm -q -M 254 /dev/sda

Worth noting for the HDD case is high fifo_expire_async (usually write) and long slice_sync to allow a single process to get high throughput (set slice_sync to lower number if you hit situations where multiple processes are waiting for some data from the disk in parallel). The slice_idle is always a compromise for HDDs but setting it somewhere in range 3-20 should be okay depending on disk usage and disk firmware. I prefer to target for low values but setting it too low will destroy your throughput. The quantum setting seems to affect throughput a lot but try to keep this as low as possible to keep latency on sensible level. Setting quantum too low will destroy throughput. Values in range 3-8 seem to work well with HDDs. The worst case latency for a read is (quantum * slice_sync) + (slice_async_rq * slice_async) ms if I've understood the kernel behavior correctly. The async is mostly used by writes and since you're willing to delay writing to disk, set both slice_async_rq and slice_async to very low numbers. However, setting slice_async_rq too low value may stall reads because writes cannot be delayed after reads any more. My config will try to write data to disk at most after 10 seconds after data has been passed to kernel but since you can tolerate loss of data on power loss also set fifo_expire_async to 3600000 to tell that 1 hour is okay for the delay to disk. Just keep the slice_async low, though, because otherwise you can get high read latency.

The hdparm command is required to prevent AAM from killing much of the performance that AHCI+NCQ allows. If your disk makes too much noise, then skip this.

Here's my setup for SSD (Intel 320 series):

echo cfq > /sys/block/sdb/queue/scheduler
echo 1 > /sys/block/sdb/queue/iosched/back_seek_penalty
echo 10000 > /sys/block/sdb/queue/iosched/fifo_expire_async
echo 20 > /sys/block/sdb/queue/iosched/fifo_expire_sync
echo 1 > /sys/block/sdb/queue/iosched/low_latency
echo 6 > /sys/block/sdb/queue/iosched/quantum
echo 2 > /sys/block/sdb/queue/iosched/slice_async
echo 10 > /sys/block/sdb/queue/iosched/slice_async_rq
echo 1 > /sys/block/sdb/queue/iosched/slice_idle
echo 20 > /sys/block/sdb/queue/iosched/slice_sync

Here it's worth noting the low values for different slice settings. The most important setting for an SSD is slice_idle which must be set to 0-1. Setting it to zero moves all ordering decisions to native NCQ while setting it to 1 allows kernel to order requests (but if the NCQ is active, the hardware may override kernel ordering partially). Test both values to see if you can see the difference. For Intel 320 series, it seems that setting slide_idle to 0 gives the best throughput but setting it to 1 gives best (lowest) overall latency. If you have recent enough kernel, you can use slide_idle_us to set the value in microseconds instead of milliseconds and you could use something like echo 14 > slice_idle_us instead. Suitable value seems to be close to 700000 divided by max practical IOPS your storage device can support so 14 is okay for pretty fast SSD devices.

For more information about these tunables, see https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt .

Update in year 2020 and kernel version 5.3 (cfq is dead):

modprobe bfq
for d in /sys/block/sd?
do
        # HDD (tuned for Seagate SMR drive)
        echo bfq > "$d/queue/scheduler"
        echo 4 > "$d/queue/nr_requests"
        echo 32000 > "$d/queue/iosched/back_seek_max"
        echo 3 > "$d/queue/iosched/back_seek_penalty"
        echo 80 > "$d/queue/iosched/fifo_expire_sync"
        echo 1000 > "$d/queue/iosched/fifo_expire_async"
        echo 5300 > "$d/queue/iosched/slice_idle_us"
        echo 1 > "$d/queue/iosched/low_latency"
        echo 200 > "$d/queue/iosched/timeout_sync"
        echo 0 > "$d/queue/iosched/max_budget"
        echo 1 > "$d/queue/iosched/strict_guarantees"

        # additional tweaks for SSD (tuned for Samsung EVO 850):
        if test $(cat "$d/queue/rotational") = "0"
        then
                echo 36 > "$d/queue/nr_requests"
                echo 1 > "$d/queue/iosched/back_seek_penalty"
                # slice_idle_us should be ~ 0.7/IOPS in µs
                echo 16 > "$d/queue/iosched/slice_idle_us"
                echo 10 > "$d/queue/iosched/fifo_expire_sync"
                echo 250 > "$d/queue/iosched/fifo_expire_async"
                echo 10 > "$d/queue/iosched/timeout_sync"
                echo 0 > "$d/queue/iosched/strict_guarantees"
        fi
done

The setup is pretty similar but I now use bfq instead of cfq because latter is not available with modern kernels. I try to keep nr_requests as low as possible to allow bfq to control the scheduling more accurately. At least Samsung SSD drives seem to require pretty deep queue to be able to run with high IOPS. Update: Many Samsung SSDs have a firmware bug and can hang the whole device if nr_requests is too high and OS submits lots of requests rapidly. I've seen random freeze about once every 2 months if I use high nr_requests (e.g. 32 or 36), but value 6 has been stable this far. Official fix is to set it to 1 but it hurts the performance a lot! For more details, see https://bugzilla.kernel.org/show_bug.cgi?id=203475 and https://bugzilla.kernel.org/show_bug.cgi?id=201693 – basically if you have Samsung SSD device and see failed command: WRITE FPDMA QUEUED in the kernel log, you've been bitten by this bug.

I'm using Ubuntu 18.04 with kernel package linux-lowlatency-hwe-18.04-edge which has bfq only as module so I need to load it before being able to switch to it.

I also nowadays also use zram but I only use 5% of RAM for zram. This allows Linux kernel to use swapping related logic without touching the disks. However, if you decide to go with zero disk swap, make sure your apps do not leak RAM or you're wasting money.

Now that we have configured kernel to load stuff from disk to cache with sensible performance, it's time to adjust the cache behavior:

According to benchmarks I've done, I wouldn't bother setting read ahead via blockdev at all. Kernel default settings are fine.

Set system to prefer swapping file data over application code (this does not matter if you have enough RAM to keep whole filesystem and all the application code and all virtual memory allocated by applications in RAM). This reduces latency for swapping between different applications over latency for accessing big files from a single application:

echo 15 > /proc/sys/vm/swappiness

If you prefer to keep applications nearly always in RAM you could set this to 1. If you set this to zero, kernel will not swap at all unless absolutely necessary to avoid OOM. If you were memory limited and working with big files (e.g. HD video editing), then it might make sense to set this close to 100.

I nowadays (2017) prefer to have no swap at all if you have enough RAM. Having no swap will usually lose 200-1000 MB of RAM on long running desktop machine. I'm willing to sacrifice that much to avoid worst case scenario latency (swapping application code in when RAM is full). In practice, this means that I prefer OOM Killer to swapping. If you allow/need swapping, you might want to increase /proc/sys/vm/watermark_scale_factor, too, to avoid some latency. I would suggest values between 100 and 500. You can consider this setting as trading CPU usage for lower swap latency. Default is 10 and maximum possible is 1000. Higher value should (according to kernel documentation) result in higher CPU usage for kswapd processes and lower overall swapping latency.

Next, tell kernel to prefer keeping directory hierarchy in memory over file contents and rest of the page cache in case some RAM needs to be freed (again, if everything fits in RAM, this setting does nothing):

echo 10 > /proc/sys/vm/vfs_cache_pressure

Setting vfs_cache_pressure to low value makes sense because in most cases, the kernel needs to know the directory structure and other filesystem metadata before it can use file contents from the cache and flushing the directory cache too soon will make the file cache next to worthless. However, page cache contains also other data but just the file contents so this setting should be considered like overall importance of metadata caching vs rest of the system. Consider going all the way down to 1 with this setting if you have lots of small files (my system has around 150K 10 megapixel photos and counts as "lots of small files" system). Never set it to zero or directory structure is always kept in memory even if the system is running out of the memory. Setting this to big value is sensible only if you have only a few big files that are constantly being re-read (again, HD video editing without enough RAM would be an example case). Official kernel documentation says that "increasing vfs_cache_pressure significantly beyond 100 may have negative performance impact".

Year 2021 update: After running with kernel version 5.4 for long enough, I've come to conclusion that very low vfs_cache_pressure setting (I used to run with 1 for years) may now be causing long stalls / bad latency when memory pressure gets high enough. However, I never noticed such behavior with kernel version 5.3 or lesser.

Year 2022 update: I've been running kernel 5.4.x series for another year and I've come to conclusion that vfs_cache_presure has changed permanently. The kernel memory manager behavior that I used to get with kernel version 5.3 or older with values in range 1..5 seem to match real world behavior with 5.4 values in range 100..120. The newer kernels make this adjustment matter more so I'd recommend value vfs_cache_presure=120 nowadays for low latency overall. Kernel version 5.3 or older should use very low but non-zero value here in my opinion.

Exception: if you have truly massive amount of files and directories and you rarely touch/read/list all files setting vfs_cache_pressure higher than 100 may be wise. This only applies if you do not have enough RAM and cannot keep whole directory structure in RAM and still having enough RAM for normal file cache and processes (e.g. company wide file server with lots of archival content). If you feel that you need to increase vfs_cache_pressure above 100 you're running without enough RAM. Increasing vfs_cache_pressure may help but the only real fix is to get more RAM. Having vfs_cache_pressure set to high number sacrifices average performance for having more stable performance overall (that is, you can avoid really bad worst case behavior but have to deal with worse overall performance).

Finally tell the kernel to use up to 99% of the RAM as cache for writes and instruct kernel to use up to 50% of RAM before slowing down the process that's writing (default for dirty_background_ratio is 10). Warning: I personally would not do this but you claimed to have enough RAM and are willing to lose the data.

echo 99 > /proc/sys/vm/dirty_ratio
echo 50 > /proc/sys/vm/dirty_background_ratio

And tell that 1h write delay is ok to even start writing stuff on the disk (again, I would not do this):

echo 360000 > /proc/sys/vm/dirty_expire_centisecs
echo 360000 > /proc/sys/vm/dirty_writeback_centisecs

For more information about these tunables, see https://www.kernel.org/doc/Documentation/sysctl/vm.txt

If you put all of those to /etc/rc.local and include following at the end, everything will be in cache as soon as possible after boot (only do this if your filesystem really fits in the RAM):

(nice find / -type f -and -not -path '/sys/*' -and -not -path '/proc/*' -print0 2>/dev/null | nice ionice -c 3 wc -l --files0-from - > /dev/null)&

Or a bit simpler alternative which might work better (cache only /home and /usr, only do this if your /home and /usr really fit in RAM):

(nice find /home /usr -type f -print0 | nice ionice -c 3 wc -l --files0-from - > /dev/null)&

Solution 2

Firstly, I DO NOT recommend you continue using NTFS, as ntfs implemention in Linux would be performance and security trouble at any time.

There are several things you can do:

  • use some newer fs such as ext4 or btrfs
  • try to change your io scheduler, for example bfq
  • turn off swap
  • use some automatic preloader like preload
  • use something like systemd to preload while booting
  • ... and something more

Maybe you want to give it a try :-)

Solution 3

Read ahead:

On 32 bit systems:

blockdev --setra 8388607 /dev/sda

On 64 bit systems:

blockdev --setra 4294967295 /dev/sda

Write behind cache:

echo 90 > /proc/sys/vm/dirty_ratio # too high rates can cause crash

This will use up to 90% of your free memory as write cache.

Or you can go all out and use tmpfs. This is only relevant if you have RAM enough. Put this in /etc/fstab. Replace 50% with the amount of your choose. It is always the percent of your whole RAM. Also 8G works as 8GB and 3M as 3MB

tmpfs /mnt/tmpfs tmpfs size=50%,rw,nosuid,nodev 0 0 # you can do that with more things, as the /tmp folder. The 50% can be replaced with 4G, 5G... 50% is the half of the whole ram. Also higher values work because it will go into swap

Then:

mkdir /mnt/tmpfs; mount -a

Then use /mnt/tmpfs.

Solution 4

You can set the read-ahead size with blockdev --setra sectors /dev/sda1, where sectors is the size you want in 512 byte sectors.

Solution 5

Not related to write caching, but related to writes:

  • For an ext4 system, you could disable journaling entirely

    This will reduce the number of disk writes for any particular update, but may leave the filesystem is an inconsistent state after an unexpected shutdown, requiring an fsck or worse.

To stop disk reads from triggering disk writes:

  • Mount with the relatime or the noatime option

    When you read a file, the "last accessed time" metadata for that file is usually updated. The noatime option will disable that behaviour. This reduces unnecessary disk writes, but you will no longer have that metadata. Some distributions (e.g. Manjaro) have adopted this as the default on all partitions (probably to increase the lifespan of earlier model SSDs).

    relatime updates the access time less frequently, according to heuristics that help to support applications which do use the atime. This is the default on Red Hat Enterprise Linux.

Other options:

  • In the comments above, Mikko shared the possibility of mounting with the nobarrier option. But Ivailo quoted RedHat who caution against it. How badly do you want that extra 3%?
Share:
142,704

Related videos on Youtube

Ivan
Author by

Ivan

Currently I live in Prague, CZ, use Arch Linux on my Toshiba L10 (Centrino "Dothan" 1.6 Mhz) laptop and code (am beginning, actually) Scala 2.8 with NetBeans 6.9. I like Scala very much (finally, the language I really like) and wouldn't mind to get a jr. Scala developer position.

Updated on September 18, 2022

Comments

  • Ivan
    Ivan over 1 year

    I am neither concerned about RAM usage (as I've got enough) nor about losing data in case of an accidental shut-down (as my power is backed, the system is reliable and the data are not critical). But I do a lot of file processing and could use some performance boost.

    That's why I'd like to set the system up to use more RAM for file system read and write caching, to prefetch files aggressively (e.g. read-ahead the whole file accessed by an application in case the file is of sane size or at least read-ahead a big chunk of it otherwise) and to flush writing buffers less frequently. How to achieve this (may it be possible)?

    I use ext3 and ntfs (I use ntfs a lot!) file systems with XUbuntu 11.10 x86.

    • Nils
      Nils about 12 years
      Do you have a raid-controller or a "normal" disc controller capable of doing write-ahead?
    • Batfan
      Batfan about 12 years
      If you have lots of RAM, care a lot about performance and don't care about data loss, just copy all your data to a RAM disk and serve it from there, discarding all updates on crash/shutdown. If that won't work for you, you may need to qualify "enough" for RAM or how critical the data isn't.
    • Ivan
      Ivan about 12 years
      @Nils, the computer is a laptop, so, I believe, the controller is pretty ordinary.
    • Nils
      Nils about 12 years
      So the controller won`t help you here. Can you comment on the answers how the according settings improved your throughput?
    • Mikko Rantalainen
      Mikko Rantalainen about 10 years
      One way to improve performance a lot is to skip durability of data. Simply disable syncing to disk even if some apps requests for sync. This will cause data loss if your storage device ever suffers loss of electricity. If you want to do it anyway, simply execute sudo mount -o ro,nobarrier /path/to/mountpoint or adjust /etc/fstab to include nobarrier for any filesystem that you're willing to sacrifice for improved performance. However, if your storage device has internal battery such as Intel 320 SSD series, using nobarrier causes no data loss.
    • DocSalvager
      DocSalvager over 5 years
      Two points -- 1) There are linux distros based on Debian or Ubuntu, like Puppy Linux and AntiX Linux, and many others that put the whole operating system in layered ramdisk partions (i.e. AUFS or overlayfs) and manage it transparently. Very Fast! -- 2) We discovered in real-world design of a very large system that throwing more cache at it can REDUCE PERFORMANCE. As storage speeds increase (i.e. SSD), the optimal cache size needed decreases. No way to know what that size is without experimentation on your particular system though. If increase is not working, try reducing it.
    • peterh
      peterh over 5 years
      @IvailoBardarov I suspect, if also the virtual machines are using nobarrier, it is not so bad.
  • Ivan
    Ivan over 12 years
    I've already moved entirely away from NTFS to ext4 once, leaving the only NTFS partition to be the Windows system partition. But it turned in many inconveniences for me and I have turned back to NTFS as the main data partition (where I store all my documents, downloads, projects, source code etc.) file system. I don't give up rethinking my partitions structure and my workflow (to use less Windows) but right now giving up NTFS doesn't seem a realistic option.
  • Felix Yan
    Felix Yan over 12 years
    If you have to use your data inside Windows too, NTFS may be the only option. (many other options available if you can use your Windows just as a VM inside linux)
  • Vladimir Panteleev
    Vladimir Panteleev over 11 years
    A well-informed and overall much better answer than the accepted one! This one is underrated... I guess most people just want simple instructions without bothering to understand what they really do...
  • Mikko Rantalainen
    Mikko Rantalainen over 11 years
    @Phpdevpad: In addition, the question said "I am neither concerned about RAM usage [...]"--I don't think any Maemo device qualifies.
  • rep_movsd
    rep_movsd almost 11 years
    Isn't noop or deadline a better scheduler for SSDs?
  • Mikko Rantalainen
    Mikko Rantalainen almost 11 years
    @rep_movsd I've been using only intel SSD drives but at least these drives are still slow enough to have better overall performance with more intelligent schedulers such as CFQ. I'd guess that if your SSD drive can deal with more than 100K random IOPS, using noop or deadline would make sense even with fast CPU. With "fast CPU" I mean something that has at least multiple 3GHz cores available for IO only.
  • Cobra_Fast
    Cobra_Fast over 10 years
    3GB or 2TB readahead? really? Do you even know what these options do?
  • Mikko Rantalainen
    Mikko Rantalainen over 9 years
    Setting vfs_cache_pressure too high (I would consider 2000 too high) will cause unnecessary disk access even for simple stuff such as directory listings which should easily fit in cache. How much RAM do you have and what are you doing with the system? As I wrote in my answer, using high value for this setting makes sense for e.g. HD video editing with limited RAM.
  • syss
    syss almost 9 years
    @Cobra_Fast Do you know what it means? I really have no idea and I am interested now.
  • Cobra_Fast
    Cobra_Fast almost 9 years
    @syss the readahead settings are saved as number of memory "blocks", not bytes or bits. The size of one block is determined at kernel compilation time (since readahead-blocks are memory blocks) or filesystem creation time in some cases. Normally though, 1 block contains 512 or 4096 bytes. See linux.die.net/man/8/blockdev
  • underscore_d
    underscore_d over 8 years
    A summary of what these supposed problems are of NTFS would have been useful.
  • Mikko Rantalainen
    Mikko Rantalainen about 6 years
    Note that the referenced documentation continues: "Increasing vfs_cache_pressure significantly beyond 100 may have negative performance impact. Reclaim code needs to take various locks to find freeable directory and inode objects. With vfs_cache_pressure=1000, it will look for ten times more freeable objects than there are."
  • Mikko Rantalainen
    Mikko Rantalainen about 6 years
    NTFS on Linux is pretty much acceptable except for the performance. Considering that the question was specifically about improving file system performance, NTFS should be the first thing to go.
  • Mikko Rantalainen
    Mikko Rantalainen about 6 years
    Even though btrfs is recently designed file system, I would avoid that if performance is needed. We've been running otherwise identical systems with btrfs and ext4 file systems and ext4 wins in real world with a big margin (btrfs seems to require about 4x CPU time the ext4 needs for the same performance level and causes more disk operations for a single logical command). Depending on workload, I would suggest ext4, jfs or xfs for any performance demanding work.
  • elpie89
    elpie89 over 5 years
    You can also read about these vm tunables from the vm kernel docs.
  • Andreus
    Andreus over 5 years
    I am violating the specific instructions in the text box when I say: This has to be one of the best-written answers I have ever seen on this site.
  • Mikko Rantalainen
    Mikko Rantalainen over 5 years
  • peterh
    peterh about 5 years
    Linux ntfs works in userspace. The worst thing what could happen is the crash of the daemon, resulting an fs failure and a minor corruption. But I never seen it. Another problem is that it can't follow the ntfs permissions and acls perfectly, and we have very little non-m$ tools to care it.
  • clerksx
    clerksx over 4 years
    I'm afraid that disabling swap may actually end up making application latency worse overall, since it reduces reclaim efficiency by denying an entire class of memory from reclaim, resulting in reclaiming even the hottest file cache page over the coldest, most stagnant anonymous page. You're still going to have to do I/O in critical path (just now for hot file pages, instead of cold(er) anon pages...): chrisdown.name/2018/01/02/in-defence-of-swap.html
  • Mikko Rantalainen
    Mikko Rantalainen about 4 years
    @ChrisDown Yeah, you're correct in theory. However, in reality if your system is running out of memory and you have any swap, the system will spend extra time playing with the swap partition before giving up and running OOM Killer. Basically, if your system has adequate amount of RAM the only case when swap is needed is some process getting out of control. Getting OOM Killer to run sooner is the preferred way. After saying that, I have to admit that I'm nowadays running zram but I have modified its implementation to reserve only 5% of the RAM in file /usr/bin/init-zram-swapping.
  • clerksx
    clerksx about 4 years
    @MikkoRantalainen I'm correct in practice, not just in theory ;-) I literally work on memory management and kernel swap behaviour at the scale of millions of machines as my day to day work, and the above article is based on that work.
  • Mikko Rantalainen
    Mikko Rantalainen about 4 years
    @ChrisDown: I agree that if you don't want to waste some RAM you should have at least some swap. I also understand that kernel will start to page in executables if no swap and no RAM. However, for absolutely minimum latency one should do without any swap and start killing processes long before disk cache gets too small. BTW, do you know any trick to get OOM Killer to activate when sum of Cached and MemAvailable is below e.g. 2 GB? I know about earlyoom but I would need a kernel fix to combat a process gone seriously wrong and any pure user mode solution is simply too slow to react.
  • Mikko Rantalainen
    Mikko Rantalainen about 4 years
  • clerksx
    clerksx about 4 years
    "However, for absolutely minimum latency one should do without any swap and start killing processes long before disk cache gets too small." -- not so, even for systems with extreme latency requirements, we successfully use oomd + swap without negative effects. // "any trick to get OOM Killer to activate when sum of Cached and MemAvailable is below e.g. 2 GB?" -- this is a poor approximation of actual memory pressure, but you can make a script triggering sysrq-based oom. You can see why they don't work here: youtu.be/beefUhRH5lU?t=1505
  • NothingsImpossible
    NothingsImpossible almost 3 years
    This is awesome. I've got a linux VM with 768MB ram on a very slow hard drive plus 2GB swap on a slow SSD. It was absolutely unusable because I/O heavy processes starved other processes and the system became unresponsive. This actually made it usable. I think people that are against this answer are misunderstanding the problem.
  • Mikko Rantalainen
    Mikko Rantalainen almost 3 years
    @NothingsImpossible: do you mean that adjusting bfq and other settings in this answer did help or did you implement the things Chris Down referenced? (I'm the author of this answer and I think you should implement both for best results. The idea to split different tasks to memory cgroups and only set minimum required for each group is genius. I'm still trying to figure out how to implement that in practice with Ubuntu and systemd.)
  • nnunes
    nnunes over 2 years
    On the use of relatime, mount(8) reads: "Since Linux 2.6.30, the kernel defaults to the behavior provided by this option (unless noatime was specified), and the strictatime option is required to obtain traditional semantics. In addition, since Linux 2.6.30, the file’s last access time is always updated if it is more than 1 day old."
  • NothingsImpossible
    NothingsImpossible about 2 years
    @MikkoRantalainen I just applied the settings from your answer, I didn't touch Chris Down's recommendations. I've been away for some months... gonna try Down's tips some time.