Unusually high dentry cache usage

30,426

Solution 1

Am I correct in thinking that the Slab memory is always physical RAM, and the number is already subtracted from the MemFree value?

Yes.

Is such a high number of dentry entries normal? The PHP application has access to around 1.5 M files, however most of them are archives and not being accessed at all for regular web traffic.

Yes, if the system isn't under memory pressure. It has to use the memory for something, and it's possible that in your particular pattern of usage, this is the best way to use that memory.

What could be an explanation for the fact that the number of cached inodes is much lower than the number of cached dentries, should they not be related somehow?

Lots of directory operations would be the most likely explanation.

If the system runs into memory trouble, should the kernel not free some of the dentries automatically? What could be a reason that this does not happen?

It should, and I can't think of any reason it wouldn't. I'm not convinced that this is what actually went wrong. I'd strongly suggest upgrading your kernel or increasing vfs_cache_pressure further.

Is there any way to "look into" the dentry cache to see what all this memory is (i.e. what are the paths that are being cached)? Perhaps this points to some kind of memory leak, symlink loop, or indeed to something the PHP application is doing wrong.

I don't believe there is. I'd look for any directories with absurdly large numbers of entries or very deep directory structures that are searched or traversed.

The PHP application code as well as all asset files are mounted via GlusterFS network file system, could that have something to do with it?

Definitely it could be a filesystem issue. A filesystem bug causing dentries not to be released, for example, is a possibility.

Solution 2

Confirmed Solution

To anyone who might run into the same problem. The data center guys finally figured it out today. The culprit was a NSS (Network Security Services) library bundled with Libcurl. An upgrade to the newest version solved the problem.

A bug report that describes the details is here:

https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1044666

Apparently, in order to determine if some path is local or on a network drive, NSS was looking up a nonexisting file and measureing the time it took for the file system to report back! If you have a large enough number of Curl requests and enough memory, these requests are all cached and stack up.

Solution 3

I ran into this exact issue, and while Wolfgang is correct about the cause, there's some important detail missing.

  • This issue impacts SSL requests done with curl or libcurl, or any other software that happens to use mozilla NSS for secure connection. Non-secure requests do not trigger the issue.

  • The problem does not require concurrent curl requests. The accumulation of dentry will occur as long as curl calls are frequent enough to outpace the OS's efforts to reclaim RAM.

  • the newer version of NSS, 3.16.0, does include a fix for this. however, you don't get the fix for free by upgrading NSS, and you don't have to upgrade all of NSS. you only have to upgrade nss-softokn (which has a required dependency on nss-utils) at a minimum. and to get the benefit, you need to set the environment variable NSS_SDB_USE_CACHE for the process that is using libcurl. the presence of that environment variable is what allows the costly non-existent file checks to be skipped.

FWIW, I wrote a blog entry with a little more background/details, in case anyone needs it.

Solution 4

See https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7/2.6.7-mm1/broken-out/vfs-shrinkage-tuning.patch

There're numbers showing that you can expect some noticeable dentry memory reclaim when vfs_cache_pressure is set a way higher than 100. So 125 can be too low for it to happen in your case.

Solution 5

Not really an explanation to your answer, but as a user of this system this information you provided:

cat /proc/meminfo
MemTotal:       132145324 kB
...
SReclaimable:   44561644 kB
SUnreclaim:      1678736 kB

Is enough to tell me that this is not your problem and its the responsibility of the sysadmin to provide an adequate explanation.

I dont want to sound to rude here but;

  • You lack specific information on the role of this host.
  • How the host is supposed to prioritize resources is out of your scope.
  • You are not familiar, or had any part in the design and deployment of the storage on this host.
  • You are unable to offer certain system output as you are not root.

It is your sysadmins responsibility to justify or resolve the slab allocation anomaly. Either you haven't given us a complete picture of the whole saga that lead you up to this (which frankly I am not interested in) or your sysadmin is behaving irresponsibly and/or incompetently in the way he considers handling this problem.

Feel free to tell him some random stranger on the internet thinks he isn't taking his responsibilities seriously.

Share:
30,426

Related videos on Youtube

Wolfgang Stengel
Author by

Wolfgang Stengel

Updated on September 18, 2022

Comments

  • Wolfgang Stengel
    Wolfgang Stengel over 1 year

    Problem

    A CentOS machine with kernel 2.6.32 and 128 GB physical RAM ran into trouble a few days ago. The responsible system administrator tells me that the PHP-FPM application was not responding to requests in a timely manner anymore due to swapping, and having seen in free that almost no memory was left, he chose to reboot the machine.

    I know that free memory can be a confusing concept on Linux and a reboot perhaps was the wrong thing to do. However, the mentioned administrator blames the PHP application (which I am responsible for) and refuses to investigate further.

    What I could find out on my own is this:

    • Before the restart, the free memory (incl. buffers and cache) was only a couple of hundred MB.
    • Before the restart, /proc/meminfo reported a Slab memory usage of around 90 GB (yes, GB).
    • After the restart, the free memory was 119 GB, going down to around 100 GB within an hour, as the PHP-FPM workers (about 600 of them) were coming back to life, each of them showing between 30 and 40 MB in the RES column in top (which has been this way for months and is perfectly reasonable given the nature of the PHP application). There is nothing else in the process list that consumes an unusual or noteworthy amount of RAM.
    • After the restart, Slab memory was around 300 MB

    If have been monitoring the system ever since, and most notably the Slab memory is increasing in a straight line with a rate of about 5 GB per day. Free memory as reported by free and /proc/meminfo decreases at the same rate. Slab is currently at 46 GB. According to slabtop most of it is used for dentry entries:

    Free memory:

    free -m
                 total       used       free     shared    buffers     cached
    Mem:        129048      76435      52612          0        144       7675
    -/+ buffers/cache:      68615      60432
    Swap:         8191          0       8191
    

    Meminfo:

    cat /proc/meminfo
    MemTotal:       132145324 kB
    MemFree:        53620068 kB
    Buffers:          147760 kB
    Cached:          8239072 kB
    SwapCached:            0 kB
    Active:         20300940 kB
    Inactive:        6512716 kB
    Active(anon):   18408460 kB
    Inactive(anon):    24736 kB
    Active(file):    1892480 kB
    Inactive(file):  6487980 kB
    Unevictable:        8608 kB
    Mlocked:            8608 kB
    SwapTotal:       8388600 kB
    SwapFree:        8388600 kB
    Dirty:             11416 kB
    Writeback:             0 kB
    AnonPages:      18436224 kB
    Mapped:            94536 kB
    Shmem:              6364 kB
    Slab:           46240380 kB
    SReclaimable:   44561644 kB
    SUnreclaim:      1678736 kB
    KernelStack:        9336 kB
    PageTables:       457516 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:    72364108 kB
    Committed_AS:   22305444 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:      480164 kB
    VmallocChunk:   34290830848 kB
    HardwareCorrupted:     0 kB
    AnonHugePages:  12216320 kB
    HugePages_Total:    2048
    HugePages_Free:     2048
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       2048 kB
    DirectMap4k:        5604 kB
    DirectMap2M:     2078720 kB
    DirectMap1G:    132120576 kB
    

    Slabtop:

    slabtop --once
    Active / Total Objects (% used)    : 225920064 / 226193412 (99.9%)
     Active / Total Slabs (% used)      : 11556364 / 11556415 (100.0%)
     Active / Total Caches (% used)     : 110 / 194 (56.7%)
     Active / Total Size (% used)       : 43278793.73K / 43315465.42K (99.9%)
     Minimum / Average / Maximum Object : 0.02K / 0.19K / 4096.00K
    
      OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
    221416340 221416039   3%    0.19K 11070817       20  44283268K dentry                 
    1123443 1122739  99%    0.41K 124827        9    499308K fuse_request           
    1122320 1122180  99%    0.75K 224464        5    897856K fuse_inode             
    761539 754272  99%    0.20K  40081       19    160324K vm_area_struct         
    437858 223259  50%    0.10K  11834       37     47336K buffer_head            
    353353 347519  98%    0.05K   4589       77     18356K anon_vma_chain         
    325090 324190  99%    0.06K   5510       59     22040K size-64                
    146272 145422  99%    0.03K   1306      112      5224K size-32                
    137625 137614  99%    1.02K  45875        3    183500K nfs_inode_cache        
    128800 118407  91%    0.04K   1400       92      5600K anon_vma               
     59101  46853  79%    0.55K   8443        7     33772K radix_tree_node        
     52620  52009  98%    0.12K   1754       30      7016K size-128               
     19359  19253  99%    0.14K    717       27      2868K sysfs_dir_cache        
     10240   7746  75%    0.19K    512       20      2048K filp  
    

    VFS cache pressure:

    cat /proc/sys/vm/vfs_cache_pressure
    125
    

    Swappiness:

    cat /proc/sys/vm/swappiness
    0
    

    I know that unused memory is wasted memory, so this should not necessarily be a bad thing (especially given that 44 GB are shown as SReclaimable). However, apparently the machine experienced problems nonetheless, and I'm afraid the same will happen again in a few days when Slab surpasses 90 GB.

    Questions

    I have these questions:

    • Am I correct in thinking that the Slab memory is always physical RAM, and the number is already subtracted from the MemFree value?
    • Is such a high number of dentry entries normal? The PHP application has access to around 1.5 M files, however most of them are archives and not being accessed at all for regular web traffic.
    • What could be an explanation for the fact that the number of cached inodes is much lower than the number of cached dentries, should they not be related somehow?
    • If the system runs into memory trouble, should the kernel not free some of the dentries automatically? What could be a reason that this does not happen?
    • Is there any way to "look into" the dentry cache to see what all this memory is (i.e. what are the paths that are being cached)? Perhaps this points to some kind of memory leak, symlink loop, or indeed to something the PHP application is doing wrong.
    • The PHP application code as well as all asset files are mounted via GlusterFS network file system, could that have something to do with it?

    Please keep in mind that I can not investigate as root, only as a regular user, and that the administrator refuses to help. He won't even run the typical echo 2 > /proc/sys/vm/drop_caches test to see if the Slab memory is indeed reclaimable.

    Any insights into what could be going on and how I can investigate any further would be greatly appreciated.

    Updates

    Some further diagnostic information:

    Mounts:

    cat /proc/self/mounts
    rootfs / rootfs rw 0 0
    proc /proc proc rw,relatime 0 0
    sysfs /sys sysfs rw,relatime 0 0
    devtmpfs /dev devtmpfs rw,relatime,size=66063000k,nr_inodes=16515750,mode=755 0 0
    devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
    tmpfs /dev/shm tmpfs rw,relatime 0 0
    /dev/mapper/sysvg-lv_root / ext4 rw,relatime,barrier=1,data=ordered 0 0
    /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
    /dev/sda1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0
    tmpfs /phptmp tmpfs rw,noatime,size=1048576k,nr_inodes=15728640,mode=777 0 0
    tmpfs /wsdltmp tmpfs rw,noatime,size=1048576k,nr_inodes=15728640,mode=777 0 0
    none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
    cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0
    cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0
    cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
    cgroup /cgroup/memory cgroup rw,relatime,memory 0 0
    cgroup /cgroup/devices cgroup rw,relatime,devices 0 0
    cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0
    cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0
    cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0
    /etc/glusterfs/glusterfs-www.vol /var/www fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0
    /etc/glusterfs/glusterfs-upload.vol /var/upload fuse.glusterfs rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072 0 0
    sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0
    172.17.39.78:/www /data/www nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=38467,timeo=600,retrans=2,sec=sys,mountaddr=172.17.39.78,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=172.17.39.78 0 0
    

    Mount info:

    cat /proc/self/mountinfo
    16 21 0:3 / /proc rw,relatime - proc proc rw
    17 21 0:0 / /sys rw,relatime - sysfs sysfs rw
    18 21 0:5 / /dev rw,relatime - devtmpfs devtmpfs rw,size=66063000k,nr_inodes=16515750,mode=755
    19 18 0:11 / /dev/pts rw,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
    20 18 0:16 / /dev/shm rw,relatime - tmpfs tmpfs rw
    21 1 253:1 / / rw,relatime - ext4 /dev/mapper/sysvg-lv_root rw,barrier=1,data=ordered
    22 16 0:15 / /proc/bus/usb rw,relatime - usbfs /proc/bus/usb rw
    23 21 8:1 / /boot rw,relatime - ext4 /dev/sda1 rw,barrier=1,data=ordered
    24 21 0:17 / /phptmp rw,noatime - tmpfs tmpfs rw,size=1048576k,nr_inodes=15728640,mode=777
    25 21 0:18 / /wsdltmp rw,noatime - tmpfs tmpfs rw,size=1048576k,nr_inodes=15728640,mode=777
    26 16 0:19 / /proc/sys/fs/binfmt_misc rw,relatime - binfmt_misc none rw
    27 21 0:20 / /cgroup/cpuset rw,relatime - cgroup cgroup rw,cpuset
    28 21 0:21 / /cgroup/cpu rw,relatime - cgroup cgroup rw,cpu
    29 21 0:22 / /cgroup/cpuacct rw,relatime - cgroup cgroup rw,cpuacct
    30 21 0:23 / /cgroup/memory rw,relatime - cgroup cgroup rw,memory
    31 21 0:24 / /cgroup/devices rw,relatime - cgroup cgroup rw,devices
    32 21 0:25 / /cgroup/freezer rw,relatime - cgroup cgroup rw,freezer
    33 21 0:26 / /cgroup/net_cls rw,relatime - cgroup cgroup rw,net_cls
    34 21 0:27 / /cgroup/blkio rw,relatime - cgroup cgroup rw,blkio
    35 21 0:28 / /var/www rw,relatime - fuse.glusterfs /etc/glusterfs/glusterfs-www.vol rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072
    36 21 0:29 / /var/upload rw,relatime - fuse.glusterfs /etc/glusterfs/glusterfs-upload.vol rw,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072
    37 21 0:30 / /var/lib/nfs/rpc_pipefs rw,relatime - rpc_pipefs sunrpc rw
    39 21 0:31 / /data/www rw,relatime - nfs 172.17.39.78:/www rw,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=38467,timeo=600,retrans=2,sec=sys,mountaddr=172.17.39.78,mountvers=3,mountport=38465,mountproto=tcp,local_lock=none,addr=172.17.39.78
    

    GlusterFS config:

    cat /etc/glusterfs/glusterfs-www.vol
    volume remote1
      type protocol/client
      option transport-type tcp
      option remote-host 172.17.39.71
       option ping-timeout 10
       option transport.socket.nodelay on # undocumented option for speed
        # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
      option remote-subvolume /data/www
    end-volume
    
    volume remote2
      type protocol/client
      option transport-type tcp
      option remote-host 172.17.39.72
       option ping-timeout 10
       option transport.socket.nodelay on # undocumented option for speed
            # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
      option remote-subvolume /data/www
    end-volume
    
    volume remote3
      type protocol/client
      option transport-type tcp
      option remote-host 172.17.39.73
       option ping-timeout 10
       option transport.socket.nodelay on # undocumented option for speed
            # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
      option remote-subvolume /data/www
    end-volume
    
    volume remote4
      type protocol/client
      option transport-type tcp
      option remote-host 172.17.39.74
       option ping-timeout 10
       option transport.socket.nodelay on # undocumented option for speed
            # http://gluster.org/pipermail/gluster-users/2009-September/003158.html
      option remote-subvolume /data/www
    end-volume
    
    volume replicate1
      type cluster/replicate
       option lookup-unhashed off    # off will reduce cpu usage, and network
       option local-volume-name 'hostname'
      subvolumes remote1 remote2
    end-volume
    
    volume replicate2
      type cluster/replicate
       option lookup-unhashed off    # off will reduce cpu usage, and network
       option local-volume-name 'hostname'
      subvolumes remote3 remote4
    end-volume
    
    volume distribute
      type cluster/distribute
      subvolumes replicate1 replicate2
    end-volume
    
    volume iocache
      type performance/io-cache
       option cache-size 8192MB        # default is 32MB
       subvolumes distribute
    end-volume
    
    volume writeback
      type performance/write-behind
      option cache-size 1024MB
      option window-size 1MB
      subvolumes iocache
    end-volume
    
    ### Add io-threads for parallel requisitions
    volume iothreads
      type performance/io-threads
      option thread-count 64 # default is 16
      subvolumes writeback
    end-volume
    
    volume ra
      type performance/read-ahead
      option page-size 2MB
      option page-count 16
      option force-atime-update no
      subvolumes iothreads
    end-volume
    
    • Matthew Ife
      Matthew Ife over 10 years
      Please provide the output of cat /proc/self/mounts and (maybe quite long) cat /proc/self/mountinfo.
    • Wolfgang Stengel
      Wolfgang Stengel over 10 years
      @MIfe I've updated the question, both outputs are appended.
    • Matthew Ife
      Matthew Ife over 10 years
      My feeling here is its probably to do with NFS dentry caching. Out of interest can you run cat /etc/nfsmount.conf. Also do you have any directories that contain many files in its immediate directory?
    • poige
      poige over 10 years
      Well, since vfs_cache_pressure > 100, kernel should prefer to reclaim dentrie cache memory. This can easily be a bug, 2.6.32 is rather old kernel, even with RedHat backport patches. BTW, what is exact kernel version?
    • Matthew Ife
      Matthew Ife over 10 years
      Oh -- and you also have a gluster volume, what is that used for and how are the bricks setup? cat /etc/glusterfs/glusterfs-www.vol
    • Wolfgang Stengel
      Wolfgang Stengel over 10 years
      @poige It's 2.6.32-431.el6.x86_64. I've been googling around for this problem though, some people claim this is a bug, but it apparently always turned out not to be.
    • Matthew Ife
      Matthew Ife over 10 years
      The top three slab lines being to do with fuse might indicate its to do with gluster, do you know if this host is used in the gluster system as a storage node at all? gluster.org/community/documentation/index.php/… this link suggests servers can become incredibly dentry heavy especially when handling large portions of small files. Check ps -Ao args | grep glust for any gluster storage/client instances and their arguments.
    • ewwhite
      ewwhite over 10 years
      (Your sysadmin sounds terrible. It gives us a bad name)
    • Wolfgang Stengel
      Wolfgang Stengel over 10 years
      For anyone that is still interested, the problem has been solved. Check out my answer below. Thanks everyone for taking the time to help me.
  • Wolfgang Stengel
    Wolfgang Stengel over 10 years
    Thank you for answering my questions individually. The cache pressure was finally increased further and the dentry cache increase stopped.
  • Wolfgang Stengel
    Wolfgang Stengel over 10 years
    I could not track down the responsible program yet. If I find out more, I'll report back for anyone else having this problem.
  • Strahinja Kustudic
    Strahinja Kustudic over 9 years
    Thanks for a nice blog post, but I would like to mention that nss-softokn has still not been updated to version 3.16 on CentOS/RHEL. It will probably be fixed in version 6.6.
  • J. Paulding
    J. Paulding over 9 years
    Thanks for the note. Perhaps Amazon got out ahead of this one (maybe even at our request?) for their managed repos. On older versions (3.14-3.15), you still get half the benefit by setting the appropriate environment variables. If you have the know-how, you might be able to build v3.16 directly. Otherwise, increasing the cache pressure and taking the associated CPU hit might be your best bet for reliable performance.
  • Strahinja Kustudic
    Strahinja Kustudic over 9 years
    This is fixed in Centos 6.6 with nss-softokn-3.14.3-17
  • Steve Kehlet
    Steve Kehlet over 9 years
    Just to be clear for people looking for a quick fix: you have to both update the nss-softoken RPM AND set the NSS_SDB_USE_CACHE=YES env var to have curl https calls stop flooding your dentry cache.
  • Some Linux Nerd
    Some Linux Nerd almost 9 years
    Thanks! Big directory (0.25 mil files) was totally the cause of the problem in my case, anytime something interacted with it 2GB of ram would disappear into the cache.
  • Mikko Rantalainen
    Mikko Rantalainen over 6 years
    From all I have read, increasing vfs_cache_pressure above 100 only makes sense if you do not have enough RAM for your workload. In that case, having value way above 100 (e.g. 10000) will free some RAM. That will result in worse IO overall, though.
  • Mikko Rantalainen
    Mikko Rantalainen over 3 years
    Note that the libnss was fixed to not issue lots of access() calls with non-existant filenames. However, I think any other program issuing lots of access() calls with non-existant filenames may still cause the same problem if one uses affected kernel versions.
  • Mikko Rantalainen
    Mikko Rantalainen over 3 years
    Do you have the blog entry available somewhere? The URL linked in the answer doesn't seem to work and Wayback machine doesn't have a copy.
  • J. Paulding
    J. Paulding over 3 years
    Sorry, Mikko. That was on my old companies' web site. The company was purchased by Knetik several years ago, and at one point, the blog was still available at tools.knetik.io/blog/2014-05-16-optimizing-aws-nss-softoken, but even that seems gone now ... However, Steve's note above is valid, if you are on an older version of nss-softoken --- and it is also asserted above that 3.17 resolves the issue, although I don't personally know if you would need to set the env variable w/ that version - you may have to check the release notes