Very high load, apparently caused by pdflush
Comment converted to answer.
Apache PIDs 24801 and 25281 are doing by far the most I/O: 5.84 M/s and 5.75 M/s, respectively. I use iotop -o
to exclude processes not doing I/O.
Related videos on Youtube
Dean Harding
I've been working in software development for around 10 years. I am planning to start my own consulting business in the near future and looking to build up a portfolio... This is my blog
Updated on September 18, 2022Comments
-
Dean Harding over 1 year
I have a server that's running CentOS 5, and it will periodically (a couple of times a day) have enormous spikes in load, and the whole server will grind to a halt. After a couple of minutes, the load will go back down and everything returns to normal.
I suspect it has something to do with I/O and maybe a bad disk, but as the disks use hardware RAID, I'm not sure how to find out what's going wrong (smartctl just says "Device does not support SMART").
So anyway, what I see from
top
is:top - 08:51:03 up 73 days, 7:45, 1 user, load average: 69.00, 58.31, 46.89 Tasks: 316 total, 2 running, 314 sleeping, 0 stopped, 0 zombie Cpu(s): 11.0%us, 1.3%sy, 0.0%ni, 15.2%id, 72.0%wa, 0.0%hi, 0.5%si, 0.0%st Mem: 8299364k total, 7998520k used, 300844k free, 15480k buffers Swap: 16779884k total, 4788k used, 16775096k free, 6547860k cached
As you can see, the load is ridiculously high. And
vmstat
shows:procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 24 16 5632 296080 23392 6317688 0 0 3 28 0 0 7 1 89 3 0 0 22 5632 292644 23600 6325372 0 0 69 18781 1985 2318 9 2 14 75 0 1 23 5656 299472 23756 6299140 0 0 44 18667 2075 3382 14 2 13 71 0 0 23 5656 304756 24152 6295696 0 0 88 17002 1880 1445 4 1 16 78 0 0 24 5656 296736 24488 6356564 0 0 60 17967 1841 990 2 1 20 76 0 0 21 5672 302248 24764 6388424 0 0 66 17216 1820 749 2 1 24 73 0
It's the really high "wa" value that looks way off to me. Also,
iotop
gives:Total DISK READ: 77.37 K/s | Total DISK WRITE: 15.81 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 25647 be/4 apache 73.50 K/s 0.00 B/s 0.00 % 99.99 % httpd 24387 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 23813 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [pdflush] 25094 be/4 root 0.00 B/s 0.00 B/s 96.72 % 99.99 % [pdflush] 25093 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 25095 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 25091 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [pdflush] 24389 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 24563 be/4 root 0.00 B/s 0.00 B/s 99.99 % 99.99 % [pdflush] 24390 be/4 apache 0.00 B/s 23.21 K/s 96.71 % 99.99 % httpd 24148 be/4 apache 0.00 B/s 0.00 B/s 96.71 % 99.99 % httpd 24699 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd 23973 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd 24270 be/4 apache 0.00 B/s 0.00 B/s 99.99 % 99.99 % httpd 24298 be/4 apache 0.00 B/s 1918.82 K/s 96.71 % 99.02 % httpd 628 be/3 root 0.00 B/s 0.00 B/s 0.00 % 97.51 % [kjournald] 25092 be/4 root 0.00 B/s 0.00 B/s 0.00 % 96.72 % [pdflush] 24258 be/4 root 0.00 B/s 0.00 B/s 99.99 % 96.71 % [pdflush] 23814 be/4 root 0.00 B/s 0.00 B/s 0.00 % 96.71 % [pdflush] 24388 be/4 root 0.00 B/s 0.00 B/s 99.02 % 96.71 % [pdflush] 25545 be/4 apache 0.00 B/s 0.00 B/s 0.19 % 92.73 % httpd 25274 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 92.38 % httpd 24801 be/4 apache 0.00 B/s 5.84 M/s 99.99 % 91.63 % httpd 25281 be/4 apache 0.00 B/s 5.75 M/s 0.00 % 91.33 % httpd 26115 be/4 apache 0.00 B/s 0.00 B/s 9.60 % 19.26 % httpd 25561 be/4 apache 0.00 B/s 3.87 K/s 0.00 % 9.66 % httpd 26035 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 9.63 % httpd
Finally, I get the following from
sar -d 5 0
:Linux 2.6.18-308.1.1.el5PAE (ausbt.com.au) 23/08/12 08:55:45 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 08:55:50 dev8-0 877.25 103.79 29306.19 33.53 158.81 179.28 1.14 99.84 08:55:50 dev8-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 08:55:50 dev8-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 08:55:50 dev8-3 877.25 103.79 29306.19 33.53 158.81 179.28 1.14 99.84
This has only recently stated happening (at least, I only noticed it recently) and nothing on the server has changed, so that's why I'm suspecting maybe some kind of hardware failure, but I'm not sure where to begin looking.
Update
Thanks to Mark Wagner's hint, I did an
strace
of one of the processes that was doing MB/s of I/O and discovered that it was writing to files called "/tmp/magick-XXXXXXX". Here's the output of `ls -l /tmp/magick-XX*":-rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXrQahSe -rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXTaXatz -rw------- 1 apache apache 1854881318400 Aug 20 04:26 /tmp/magick-XXtf25pe
Wow! Those files were from a couple of days ago, but there's also files from today that are a similar size. My code dynamically generates thumbnails of images using ImageMagick so perhaps there's a corrupted image somewhere that's causing ImageMagick to freak out and write 1.6 terrabyte files to /tmp.
I'll do some more poking around and post an update when I find more. Thanks everybody for the hints so far.
-
MastaJeet over 11 yearsApache PIDs 24801 and 25281 are doing by far the most I/O: 5.84 M/s and 5.75 M/s, respectively. I use
iotop -o
to exclude processes not doing I/O. -
DerfK over 11 yearsSome hardware raid controllers expose raw drive interfaces as SCSI generic devices
sg#
for the purpose of querying SMART information (see here ), but I'm not sure how to tell which sg# devices, if any, are which drives. -
David Schwartz over 11 yearsMy bet is that you have some process that is calling
fsync
, forcing a massive writeout. Tuning is, by default, for great average behavior with the occasional explosion. You can tune for worse average behavior but better worst-case behavior. You can dropdirty_ratio
ordirty_expire_centisecs
. (What RAID controller are you using?) -
Dean Harding over 11 years@MarkWagner Thanks for the hint, good spotting! I've updated the question with what I've discovered so far, but I think you've given me enough of a hint to continue the investigation :)
-
ewwhite over 11 yearsCan you provide information on what hardware components you're using?
-
Dean Harding over 11 yearsThanks all for the input, but @MarkWagner's suggestion was the one. In the end it was a request for an image thumbnail that was 300000x200000 pixels which was causing ImageMagick to go nuts. I don't know if it was some smart alec trying to DOS us or just a weird link somewhere, but anyway I fixed the code to limit the size of thumbnails we generate :) Mark if you want the rep, you can convert your comment to an answer and I'll mark it as correct.
-
David Schwartz over 11 yearsSounds like a DoS attack to me.
-
-
Dean Harding over 11 yearsI'm not sure, but the two processes that were shown doing ~5MB/s of I/O in my screengrab above were definitely doing a lot of I/O.