High server load - [jbd2/md1-8] using 99.99% IO

32,614

Solution 1

This is not really an answer as there is not enough context to give the exact cause, but it is a description of how I managed to track this when it happened to me.

I noticed my jbd2/md0-8 kept showing up at the top of iotop. I looked in /sys/kernel/debug/tracing/events/jbd2 to see what options there are to determine what jbd2 was doing.

NOTE-1: To see output for debug tracing events cat /sys/kernel/debug/tracing/trace_pipe - I had this running in terminal while enabling/disabling traces.

NOTE-2: To enable events for tracing use e.g. echo 1 > /sys/kernel/debug/tracing/events/jbd2/jbd2_run_stats/enable. To disable echo 0 > /sys/kernel/debug/tracing/events/jbd2/jbd2_run_stats/enable.

I started by enabling /sys/kernel/debug/tracing/events/jbd2/jbd2_run_stats/enable - but there was nothing that seemed particularly interesting in the output for it. I tried a few other events to trace and when I enabled /sys/kernel/debug/tracing/events/jbd2/jbd2_commit_flushing/enable I saw it was occurring every second:

# cat /sys/kernel/debug/tracing/trace_pipe
...
jbd2/md0-8-2520  [004] .... 658660.216492: jbd2_commit_flushing: dev 9,0 transaction 32856413 sync 0
jbd2/md0-8-2520  [001] .... 658661.334900: jbd2_commit_flushing: dev 9,0 transaction 32856414 sync 0
jbd2/md0-8-2520  [001] .... 658661.394113: jbd2_commit_flushing: dev 9,0 transaction 32856415 sync 0

This looked like it was related to sync(2)/fsync(2)/msync(2), so I looked for some way to link this to a process and found this:

# find /sys/kernel/debug/tracing/events/ | grep sync.*enable
...
/sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable
...

When I enabled it I saw the following output:

# cat /sys/kernel/debug/tracing/trace_pipe
...
      nzbget-17367 [002] .... 658693.222288: ext4_sync_file_enter: dev 9,0 ino 301924373 parent 301924357 datasync 1 
  jbd2/md0-8-2520  [001] .... 658693.284080: jbd2_commit_flushing: dev 9,0 transaction 32856465 sync 0
      nzbget-17367 [000] .... 658693.334267: ext4_sync_file_enter: dev 9,0 ino 301924357 parent 301924353 datasync 1 
  jbd2/md0-8-2520  [002] .... 658693.334275: jbd2_commit_flushing: dev 9,0 transaction 32856466 sync 0
      nzbget-17367 [001] .... 658694.369514: ext4_sync_file_enter: dev 9,0 ino 301924367 parent 301924357 datasync 1 
  jbd2/md0-8-2520  [002] .... 658694.414861: jbd2_commit_flushing: dev 9,0 transaction 32856467 sync 0
      nzbget-17367 [001] .... 658694.470872: ext4_sync_file_enter: dev 9,0 ino 301924357 parent 301924353 datasync 1 
  jbd2/md0-8-2520  [002] .... 658694.470880: jbd2_commit_flushing: dev 9,0 transaction 32856468 sync 0

This gave me the process name/id - and after doing some more debugging of this process (nzbget) I discovered it was doing fsync(2) every second. After I changed its config (FlushQueue=no, undocumented I think, found it in source) to stop it from doing this per second fsync(2) the problem went away.

My kernel version is 4.4.6-gentoo.I think there were some options I enabled (either manually or with make oldconfig) at some point in kernel config to get /sys/kernel/debug with these events - so if you don't have it maybe just look around the internet for more information on enabling it.

Solution 2

This seems to be a journal update related thing. How many disks are the software RAID made up of. Can you show me the command used to create it.

Can you also pastebin the dumpe2fs output. First, identify the physical device where you see load. Use df to know this. Then,

dumpe2fs /dev/sdaX > /tmp/dump

For your case, it might be /dev/md0.

Also, run this.

iostat -xdk 1 25

At the time of the high IO issue.

I don't know cloudlinux but is the tool blktrace available under it.

Share:
32,614

Related videos on Youtube

Alex
Author by

Alex

Updated on September 18, 2022

Comments

  • Alex
    Alex almost 2 years

    I've been having spike in load over the last week. This usually occurs once or twice a day. I've managed to identify from iotop that [jbd2/md1-8] is using 99.99 % IO. During the high load times there is no high traffic to the server.

    Server specs are:

    • AMD Opteron 8 core
    • 16 GB RAM
    • 2x2.000 GB 7.200 RPM HDD Software Raid 1
    • Cloudlinux + Cpanel
    • Mysql is properly tuned

    Apart from the spikes, the load usually is around 0.80 at most.

    I've searched around but can't find what [jbd2/md1-8] does exactly. Has anyone had this problem or does anyone know a possible solution?

    Thank you.

    UPDATE:

    TIME        TID     PRIO     USER    DISK READ    DISK WRITE    SWAPIN  IO       COMMAND
    16:05:36     399     be/3    root    0.00 B/s      38.76 K/s    0.00 %  99.99 %  [jbd2/md1-8]
    
    • mbrownnyc
      mbrownnyc over 11 years
      en.wikipedia.org/wiki/Journaling_block_device & linux.die.net/man/4/md point to this being software RAID related.
    • Alex
      Alex over 11 years
      Thanks for your reply. After doing some digging I found that it's related to the software RAID. Do you know any solution to it? The strange thing that it started happening just a week ago, after almost 3 months of no problems.
    • slm
      slm over 11 years
      How did you determine the IO is 99.99%? Did you use iostat? Can you run a little of that (say iostat 5) for a bit and share the output?
    • Alex
      Alex over 11 years
      I enabled logging for iotop and looked at the log for the interval that the load occurred. Now the load is low so there's no point to run it now, but I will do it the next time it occurs. Thanks for your reply.
    • Satanicpuppy
      Satanicpuppy over 9 years
      I just ran into this exact issue. What did your final solution end up being?
  • Alex
    Alex over 11 years
    Hi Soham, thanks for your reply. There are 2 disks in the array. As for dumpe2fs can you please give me the full command you want me to run? Thanks for helping.
  • Soham Chakraborty
    Soham Chakraborty over 11 years
    Alex, edited the answer.
  • TomTom
    TomTom over 9 years
    Never forget hat this is not really any even mid perforamnce setup from the discs - "slow as a workstation" does more describe it.
  • jdhildeb
    jdhildeb about 5 years
    Nice sleuthing. This is very helpful.
  • astrojuanlu
    astrojuanlu about 5 years
    Thanks a lot for detailing all the process!