How do we find out what is causing high cache usage on our server?

linux cache memory-usage

28,188

Solution 1

I see bpbkar is also active, i would look at that process. It is part of Symantec NetBackup, check to see if you have a backup running at the time of the problems. Disable it and see if the problem occurs again when no backup is scheduled.

If bpbkar is the guilty process, you should enable full logging on it to find out where it is causing this problem. Be sure to have the latest updates installed for it, as they always solve a big list of problems.

Solution 2

Check out the slabtop utility. Also have you made sure, their isn't an I/O intensive cron job scheduled on your system around that time frame (i.e. mlocate/updatedb..)?

slabtop --sort=c

28,188

Author by

Tony Stark

Updated on September 18, 2022

Comments

Tony Stark almost 2 years

Every Friday from about 22:00 our server starts using large amounts of cache and then dies about two hours later. Please look at the following Cacti graph.

Memory Usage

We tried hunting processes which use large amounts of memory with https://raw.githubusercontent.com/pixelb/ps_mem/master/ps_mem.py but all it shows is the following.

...
438.0 MiB +   1.1 MiB = 439.1 MiB       XXXEngine XXX 961f4dbc-3b01-0000-0080-ff115176831d xxx
520.2 MiB +   1.7 MiB = 521.9 MiB       XXXEngine XXX f2ac330c-3a01-0000-0080-a2adb5561889 xxx
 10.4 GiB + 829.0 KiB =  10.4 GiB       java -server -Xms1G -Xmx5G -Djava.net.preferIPv4Stack=true -cp ../lib/hazelcast-3.2.2.jar:../lib/xxx.cache.jar com.hazelcast.examples.StartServer (2)
---------------------------------
                         28.1 GiB
=================================

This is nothing near to the 100G of cache and we were thinking that Linux may be using that much memory for caching disk I/O so we used atop to measure it. This is what we get when we run atop -r atop-20140919-230002-062979000.bin -d -D (-c).

  PID                                   TID                                   RDDSK                                  WRDSK                                  WCANCL                                   DSK                                 CMD       1/405
    1                                     -                                  907.9G                                  17.0T                                    2.8T                                   97%                                 init
 6513                                     -                                  175.1G                                  46.1G                                    5.9G                                    1%                                 crond
 8842                                     -                                      8K                                 110.3G                                    128K                                    1%                                 xxxzmuc0
 6296                                     -                                    6.5G                                  25.1G                                   15.9G                                    0%                                 sshd
 4463                                     -                                   4668K                                  23.2G                                      0K                                    0%                                 kjournald
19681                                     -                                   1835K                                  22.5G                                   22.4G                                    0%                                 xxxtroker
 4469                                     -                                   4728K                                  15.2G                                      0K                                    0%                                 kjournald
 4475                                     -                                   4716K                                  14.9G                                      0K                                    0%                                 kjournald
 2401                                     -                                    588K                                  11.4G                                      0K                                    0%                                 kjournald
 8652                                     -                                    7.0G                                   2.6G                                    1.3G                                    0%                                 k6gagent
26093                                     -                                    9.5G                                     0K                                      0K                                    0%                                 bpbkar
...

And atop with option -c.

  PID   TID S  DSK COMMAND-LINE (horizontal scroll with <- and -> keys)                                                                                                                                                                            1/405
    1     - S  97% init [3]
 6513     - S   1% crond
 8842     - S   1% xxzmuc0 -m XXX
 6296     - S   0% /usr/sbin/sshd
 4463     - S   0% kjournald
19681     - S   0% xxxtroker XXX
 4469     - S   0% kjournald
 4475     - S   0% kjournald
 2401     - S   0% kjournald
 8652     - S   0% /opt/IBM/ITM/lx8266/6g/bin/k6gagent
26093     - S   0% bpbkar -r 2678400 -ru root -dt 0 -to 0 -clnt ...
...

So what I can see is that init has written 17 Terabytes of data onto the disk which seems a lot. However I have no idea how to find out what is causing this. I was of the opinion that Linux is using cache to speed up disk operations but gives it back when processes are needing it and that it is not possible to kill off a server with cache.

We are on "Red Hat Enterprise Linux Server release 5.5 (Tikanga)" Linux delirium 2.6.18-194.26.1.el5 #1 SMP Fri Oct 29 14:21:16 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux.

What should we do (next) to find out what's wrong?

Michael Hampton almost 10 years

Cache being in use isn't a problem by itself. It's generally a good thing. Though a graph like that suggests some sort of runaway process to me.
Dan Pritts almost 10 years

That's an awfully old kernel. The current(-ish) release is 5.11 with kernel kernel-2.6.18-398.el5.x86_64. You might try running a full yum update. If you don't have an RHEL subscription anymore you should be able to update against Centos servers. Clearly, though, make and test a backup of the full system before you run such an update. With so many changes all at once it is possible something will go kaput. To directly troubleshoot your problem I would also try logging what processes are running for a couple hours before and after the crash.
David Schwartz almost 10 years

You need to understand why the machine is dying, not why it is using memory. Using memory is normal, dying is not. Can you describe the problem a bit better? When it "dies", what exactly happens? Does it crash? Reboot? Respond to pings? Log something?
Brian almost 10 years

Looks like something is pinning cache pages and a lot of them. Otherwise cache would be freed to satisfy other memory allocations instead of the server dieing. Look for tmpfs or ramfs being used to hold a lot on a Friday etc.
Marcel almost 10 years

Looks for frequent Garbage Collector running more than the time to do a complete GC cycle on java servlet container logs. If your GC cycle don't execute in time or is not configured to be non-blocking, when GC happens and your system isn't able to recover, it stars thrashing the OS, swapping to 100% and starts allocate that kind of volume in cached filesystem. IMHO your application is causing this memory allocation madness, not backup.