Unusually high dentry cache usage
Solution 1
Am I correct in thinking that the Slab memory is always physical RAM, and the number is already subtracted from the MemFree value?
Yes.
Is such a high number of dentry entries normal? The PHP application has access to around 1.5 M files, however most of them are archives and not being accessed at all for regular web traffic.
Yes, if the system isn't under memory pressure. It has to use the memory for something, and it's possible that in your particular pattern of usage, this is the best way to use that memory.
What could be an explanation for the fact that the number of cached inodes is much lower than the number of cached dentries, should they not be related somehow?
Lots of directory operations would be the most likely explanation.
If the system runs into memory trouble, should the kernel not free some of the dentries automatically? What could be a reason that this does not happen?
It should, and I can't think of any reason it wouldn't. I'm not convinced that this is what actually went wrong. I'd strongly suggest upgrading your kernel or increasing vfs_cache_pressure further.
Is there any way to "look into" the dentry cache to see what all this memory is (i.e. what are the paths that are being cached)? Perhaps this points to some kind of memory leak, symlink loop, or indeed to something the PHP application is doing wrong.
I don't believe there is. I'd look for any directories with absurdly large numbers of entries or very deep directory structures that are searched or traversed.
The PHP application code as well as all asset files are mounted via GlusterFS network file system, could that have something to do with it?
Definitely it could be a filesystem issue. A filesystem bug causing dentries not to be released, for example, is a possibility.
Solution 2
Confirmed Solution
To anyone who might run into the same problem. The data center guys finally figured it out today. The culprit was a NSS (Network Security Services) library bundled with Libcurl. An upgrade to the newest version solved the problem.
A bug report that describes the details is here:
https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1044666
Apparently, in order to determine if some path is local or on a network drive, NSS was looking up a nonexisting file and measureing the time it took for the file system to report back! If you have a large enough number of Curl requests and enough memory, these requests are all cached and stack up.
Solution 3
I ran into this exact issue, and while Wolfgang is correct about the cause, there's some important detail missing.
This issue impacts SSL requests done with curl or libcurl, or any other software that happens to use mozilla NSS for secure connection. Non-secure requests do not trigger the issue.
The problem does not require concurrent curl requests. The accumulation of dentry will occur as long as curl calls are frequent enough to outpace the OS's efforts to reclaim RAM.
the newer version of NSS, 3.16.0, does include a fix for this. however, you don't get the fix for free by upgrading NSS, and you don't have to upgrade all of NSS. you only have to upgrade nss-softokn (which has a required dependency on nss-utils) at a minimum. and to get the benefit, you need to set the environment variable NSS_SDB_USE_CACHE for the process that is using libcurl. the presence of that environment variable is what allows the costly non-existent file checks to be skipped.
FWIW, I wrote a blog entry with a little more background/details, in case anyone needs it.
Solution 4
There're numbers showing that you can expect some noticeable dentry memory reclaim when vfs_cache_pressure is set a way higher than 100. So 125 can be too low for it to happen in your case.
Solution 5
Not really an explanation to your answer, but as a user of this system this information you provided:
cat /proc/meminfo
MemTotal: 132145324 kB
...
SReclaimable: 44561644 kB
SUnreclaim: 1678736 kB
Is enough to tell me that this is not your problem and its the responsibility of the sysadmin to provide an adequate explanation.
I dont want to sound to rude here but;
- You lack specific information on the role of this host.
- How the host is supposed to prioritize resources is out of your scope.
- You are not familiar, or had any part in the design and deployment of the storage on this host.
- You are unable to offer certain system output as you are not root.
It is your sysadmins responsibility to justify or resolve the slab allocation anomaly. Either you haven't given us a complete picture of the whole saga that lead you up to this (which frankly I am not interested in) or your sysadmin is behaving irresponsibly and/or incompetently in the way he considers handling this problem.
Feel free to tell him some random stranger on the internet thinks he isn't taking his responsibilities seriously.
Related videos on Youtube
Wolfgang Stengel
Updated on September 18, 2022Comments
-
Wolfgang Stengel over 1 year
Problem
A CentOS machine with kernel 2.6.32 and 128 GB physical RAM ran into trouble a few days ago. The responsible system administrator tells me that the PHP-FPM application was not responding to requests in a timely manner anymore due to swapping, and having seen in
free
that almost no memory was left, he chose to reboot the machine.I know that free memory can be a confusing concept on Linux and a reboot perhaps was the wrong thing to do. However, the mentioned administrator blames the PHP application (which I am responsible for) and refuses to investigate further.
What I could find out on my own is this:
- Before the restart, the free memory (incl. buffers and cache) was only a couple of hundred MB.
- Before the restart,
/proc/meminfo
reported a Slab memory usage of around 90 GB (yes, GB). - After the restart, the free memory was 119 GB, going down to around 100 GB within an hour, as the PHP-FPM workers (about 600 of them) were coming back to life, each of them showing between 30 and 40 MB in the RES column in top (which has been this way for months and is perfectly reasonable given the nature of the PHP application). There is nothing else in the process list that consumes an unusual or noteworthy amount of RAM.
- After the restart, Slab memory was around 300 MB
If have been monitoring the system ever since, and most notably the Slab memory is increasing in a straight line with a rate of about 5 GB per day. Free memory as reported by
free
and/proc/meminfo
decreases at the same rate. Slab is currently at 46 GB. According toslabtop
most of it is used fordentry
entries:Free memory:
free -m total used free shared buffers cached Mem: 129048 76435 52612 0 144 7675 -/+ buffers/cache: 68615 60432 Swap: 8191 0 8191
Meminfo:
cat /proc/meminfo MemTotal: 132145324 kB MemFree: 53620068 kB Buffers: 147760 kB Cached: 8239072 kB SwapCached: 0 kB Active: 20300940 kB Inactive: 6512716 kB Active(anon): 18408460 kB Inactive(anon): 24736 kB Active(file): 1892480 kB Inactive(file): 6487980 kB Unevictable: 8608 kB Mlocked: 8608 kB SwapTotal: 8388600 kB SwapFree: 8388600 kB Dirty: 11416 kB Writeback: 0 kB AnonPages: 18436224 kB Mapped: 94536 kB Shmem: 6364 kB Slab: 46240380 kB SReclaimable: 44561644 kB SUnreclaim: 1678736 kB KernelStack: 9336 kB PageTables: 457516 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 72364108 kB Committed_AS: 22305444 kB VmallocTotal: 34359738367 kB VmallocUsed: 480164 kB VmallocChunk: 34290830848 kB HardwareCorrupted: 0 kB AnonHugePages: 12216320 kB HugePages_Total: 2048 HugePages_Free: 2048 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 5604 kB DirectMap2M: 2078720 kB DirectMap1G: 132120576 kB
Slabtop:
slabtop --once Active / Total Objects (% used) : 225920064 / 226193412 (99.9%) Active / Total Slabs (% used) : 11556364 / 11556415 (100.0%) Active / Total Caches (% used) : 110 / 194 (56.7%) Active / Total Size (% used) : 43278793.73K / 43315465.42K (99.9%) Minimum / Average / Maximum Object : 0.02K / 0.19K / 4096.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 221416340 221416039 3% 0.19K 11070817 20 44283268K dentry 1123443 1122739 99% 0.41K 124827 9 499308K fuse_request 1122320 1122180 99% 0.75K 224464 5 897856K fuse_inode 761539 754272 99% 0.20K 40081 19 160324K vm_area_struct 437858 223259 50% 0.10K 11834 37 47336K buffer_head 353353 347519 98% 0.05K 4589 77 18356K anon_vma_chain 325090 324190 99% 0.06K 5510 59 22040K size-64 146272 145422 99% 0.03K 1306 112 5224K size-32 137625 137614 99% 1.02K 45875 3 183500K nfs_inode_cache 128800 118407 91% 0.04K 1400 92 5600K anon_vma 59101 46853 79% 0.55K 8443 7 33772K radix_tree_node 52620 52009 98% 0.12K 1754 30 7016K size-128 19359 19253 99% 0.14K 717 27 2868K sysfs_dir_cache 10240 7746 75% 0.19K 512 20 2048K filp
VFS cache pressure:
cat /proc/sys/vm/vfs_cache_pressure 125
Swappiness:
cat /proc/sys/vm/swappiness 0
I know that unused memory is wasted memory, so this should not necessarily be a bad thing (especially given that 44 GB are shown as SReclaimable). However, apparently the machine experienced problems nonetheless, and I'm afraid the same will happen again in a few days when Slab surpasses 90 GB.
Questions
I have these questions:
- Am I correct in thinking that the Slab memory is always physical RAM, and the number is already subtracted from the MemFree value?
- Is such a high number of dentry entries normal? The PHP application has access to around 1.5 M files, however most of them are archives and not being accessed at all for regular web traffic.
- What could be an explanation for the fact that the number of cached inodes is much lower than the number of cached dentries, should they not be related somehow?
- If the system runs into memory trouble, should the kernel not free some of the dentries automatically? What could be a reason that this does not happen?
- Is there any way to "look into" the dentry cache to see what all this memory is (i.e. what are the paths that are being cached)? Perhaps this points to some kind of memory leak, symlink loop, or indeed to something the PHP application is doing wrong.
- The PHP application code as well as all asset files are mounted via GlusterFS network file system, could that have something to do with it?
Please keep in mind that I can not investigate as root, only as a regular user, and that the administrator refuses to help. He won't even run the typical
echo 2 > /proc/sys/vm/drop_caches
test to see if the Slab memory is indeed reclaimable.Any insights into what could be going on and how I can investigate any further would be greatly appreciated.
Updates
Some further diagnostic information:
Mounts:
cat /proc/self/mounts rootfs / rootfs rw 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,relatime,size=66063000k,nr_inodes=16515750,mode=755 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,relatime 0 0 /dev/mapper/sysvg-lv_root / ext4 rw,relatime,barrier=1,data=ordered 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/sda1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0 tmpfs /phptmp tmpfs rw,noatime,size=1048576k,nr_inodes=15728640,mode=777 0 0 tmpfs /wsdltmp tmpfs rw,noatime,size=1048576k,nr_inodes=15728640,mode=777 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 /etc/glusterfs/glusterfs-www.vol /var/www fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0 /etc/glusterfs/glusterfs-upload.vol /var/upload fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0 sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 172.17.39.78:/www /data/www nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=38467,timeo=600,retrans=2,sec=sys,mountaddr=172.17.39.78,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=172.17.39.78 0 0
Mount info:
cat /proc/self/mountinfo 16 21 0:3 / /proc rw,relatime - proc proc rw 17 21 0:0 / /sys rw,relatime - sysfs sysfs rw 18 21 0:5 / /dev rw,relatime - devtmpfs devtmpfs rw,size=66063000k,nr_inodes=16515750,mode=755 19 18 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000 20 18 0:16 / /dev/shm rw,relatime - tmpfs tmpfs rw 21 1 253:1 / / rw,relatime - ext4 /dev/mapper/sysvg-lv_root rw,barrier=1,data=ordered 22 16 0:15 / /proc/bus/usb rw,relatime - usbfs /proc/bus/usb rw 23 21 8:1 / /boot rw,relatime - ext4 /dev/sda1 rw,barrier=1,data=ordered 24 21 0:17 / /phptmp rw,noatime - tmpfs tmpfs rw,size=1048576k,nr_inodes=15728640,mode=777 25 21 0:18 / /wsdltmp rw,noatime - tmpfs tmpfs rw,size=1048576k,nr_inodes=15728640,mode=777 26 16 0:19 / /proc/sys/fs/binfmt_misc rw,relatime - binfmt_misc none rw 27 21 0:20 / /cgroup/cpuset rw,relatime - cgroup cgroup rw,cpuset 28 21 0:21 / /cgroup/cpu rw,relatime - cgroup cgroup rw,cpu 29 21 0:22 / /cgroup/cpuacct rw,relatime - cgroup cgroup rw,cpuacct 30 21 0:23 / /cgroup/memory rw,relatime - cgroup cgroup rw,memory 31 21 0:24 / /cgroup/devices rw,relatime - cgroup cgroup rw,devices 32 21 0:25 / /cgroup/freezer rw,relatime - cgroup cgroup rw,freezer 33 21 0:26 / /cgroup/net_cls rw,relatime - cgroup cgroup rw,net_cls 34 21 0:27 / /cgroup/blkio rw,relatime - cgroup cgroup rw,blkio 35 21 0:28 / /var/www rw,relatime - fuse.glusterfs /etc/glusterfs/glusterfs-www.vol rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 36 21 0:29 / /var/upload rw,relatime - fuse.glusterfs /etc/glusterfs/glusterfs-upload.vol rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 37 21 0:30 / /var/lib/nfs/rpc_pipefs rw,relatime - rpc_pipefs sunrpc rw 39 21 0:31 / /data/www rw,relatime - nfs 172.17.39.78:/www rw,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=38467,timeo=600,retrans=2,sec=sys,mountaddr=172.17.39.78,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=172.17.39.78
GlusterFS config:
cat /etc/glusterfs/glusterfs-www.vol volume remote1 type protocol/client option transport-type tcp option remote-host 172.17.39.71 option ping-timeout 10 option transport.socket.nodelay on # undocumented option for speed # http://gluster.org/pipermail/gluster-users/2009-September/003158.html option remote-subvolume /data/www end-volume volume remote2 type protocol/client option transport-type tcp option remote-host 172.17.39.72 option ping-timeout 10 option transport.socket.nodelay on # undocumented option for speed # http://gluster.org/pipermail/gluster-users/2009-September/003158.html option remote-subvolume /data/www end-volume volume remote3 type protocol/client option transport-type tcp option remote-host 172.17.39.73 option ping-timeout 10 option transport.socket.nodelay on # undocumented option for speed # http://gluster.org/pipermail/gluster-users/2009-September/003158.html option remote-subvolume /data/www end-volume volume remote4 type protocol/client option transport-type tcp option remote-host 172.17.39.74 option ping-timeout 10 option transport.socket.nodelay on # undocumented option for speed # http://gluster.org/pipermail/gluster-users/2009-September/003158.html option remote-subvolume /data/www end-volume volume replicate1 type cluster/replicate option lookup-unhashed off # off will reduce cpu usage, and network option local-volume-name 'hostname' subvolumes remote1 remote2 end-volume volume replicate2 type cluster/replicate option lookup-unhashed off # off will reduce cpu usage, and network option local-volume-name 'hostname' subvolumes remote3 remote4 end-volume volume distribute type cluster/distribute subvolumes replicate1 replicate2 end-volume volume iocache type performance/io-cache option cache-size 8192MB # default is 32MB subvolumes distribute end-volume volume writeback type performance/write-behind option cache-size 1024MB option window-size 1MB subvolumes iocache end-volume ### Add io-threads for parallel requisitions volume iothreads type performance/io-threads option thread-count 64 # default is 16 subvolumes writeback end-volume volume ra type performance/read-ahead option page-size 2MB option page-count 16 option force-atime-update no subvolumes iothreads end-volume
-
Matthew Ife over 10 yearsPlease provide the output of
cat /proc/self/mounts
and (maybe quite long)cat /proc/self/mountinfo
. -
Wolfgang Stengel over 10 years@MIfe I've updated the question, both outputs are appended.
-
Matthew Ife over 10 yearsMy feeling here is its probably to do with NFS dentry caching. Out of interest can you run
cat /etc/nfsmount.conf
. Also do you have any directories that contain many files in its immediate directory? -
poige over 10 yearsWell, since vfs_cache_pressure > 100, kernel should prefer to reclaim dentrie cache memory. This can easily be a bug, 2.6.32 is rather old kernel, even with RedHat backport patches. BTW, what is exact kernel version?
-
Matthew Ife over 10 yearsOh -- and you also have a gluster volume, what is that used for and how are the bricks setup?
cat /etc/glusterfs/glusterfs-www.vol
-
Wolfgang Stengel over 10 years@poige It's 2.6.32-431.el6.x86_64. I've been googling around for this problem though, some people claim this is a bug, but it apparently always turned out not to be.
-
Matthew Ife over 10 yearsThe top three slab lines being to do with fuse might indicate its to do with gluster, do you know if this host is used in the gluster system as a storage node at all? gluster.org/community/documentation/index.php/… this link suggests servers can become incredibly dentry heavy especially when handling large portions of small files. Check
ps -Ao args | grep glust
for any gluster storage/client instances and their arguments. -
ewwhite over 10 years(Your sysadmin sounds terrible. It gives us a bad name)
-
Wolfgang Stengel over 10 yearsFor anyone that is still interested, the problem has been solved. Check out my answer below. Thanks everyone for taking the time to help me.
-
Wolfgang Stengel over 10 yearsThank you for answering my questions individually. The cache pressure was finally increased further and the dentry cache increase stopped.
-
Wolfgang Stengel over 10 yearsI could not track down the responsible program yet. If I find out more, I'll report back for anyone else having this problem.
-
Strahinja Kustudic over 9 yearsThanks for a nice blog post, but I would like to mention that nss-softokn has still not been updated to version 3.16 on CentOS/RHEL. It will probably be fixed in version 6.6.
-
J. Paulding over 9 yearsThanks for the note. Perhaps Amazon got out ahead of this one (maybe even at our request?) for their managed repos. On older versions (3.14-3.15), you still get half the benefit by setting the appropriate environment variables. If you have the know-how, you might be able to build v3.16 directly. Otherwise, increasing the cache pressure and taking the associated CPU hit might be your best bet for reliable performance.
-
Strahinja Kustudic over 9 yearsThis is fixed in Centos 6.6 with nss-softokn-3.14.3-17
-
Steve Kehlet over 9 yearsJust to be clear for people looking for a quick fix: you have to both update the
nss-softoken
RPM AND set theNSS_SDB_USE_CACHE=YES
env var to have curl https calls stop flooding your dentry cache. -
Some Linux Nerd almost 9 yearsThanks! Big directory (0.25 mil files) was totally the cause of the problem in my case, anytime something interacted with it 2GB of ram would disappear into the cache.
-
Mikko Rantalainen over 6 yearsFrom all I have read, increasing
vfs_cache_pressure
above100
only makes sense if you do not have enough RAM for your workload. In that case, having value way above 100 (e.g. 10000) will free some RAM. That will result in worse IO overall, though. -
Mikko Rantalainen over 3 yearsNote that the
libnss
was fixed to not issue lots ofaccess()
calls with non-existant filenames. However, I think any other program issuing lots ofaccess()
calls with non-existant filenames may still cause the same problem if one uses affected kernel versions. -
Mikko Rantalainen over 3 yearsDo you have the blog entry available somewhere? The URL linked in the answer doesn't seem to work and Wayback machine doesn't have a copy.
-
J. Paulding over 3 yearsSorry, Mikko. That was on my old companies' web site. The company was purchased by Knetik several years ago, and at one point, the blog was still available at tools.knetik.io/blog/2014-05-16-optimizing-aws-nss-softoken, but even that seems gone now ... However, Steve's note above is valid, if you are on an older version of nss-softoken --- and it is also asserted above that 3.17 resolves the issue, although I don't personally know if you would need to set the env variable w/ that version - you may have to check the release notes