Massive, unpredictable I/O performance drop in Linux

12,317

Solution 1

I managed to reproduce the problem again and it was result of a big disk cache. My disk caches can grow more than 8GB and seems that some applications doesn't like it and I/O suffers.

Dropping disk caches with echo 3 > /proc/sys/vm/drop_caches as root remedies the problem. I currently don't know why large disk caches causes this I/O degradation.

Last Update: After more investigation I've found out that number of files in the cache was triggering the problem. It was trashing the disks while trying to commit many small files back to the disk. Since I was using the system for ten years, I've took the plunge and reinstalled with 64 bit Debian. Now it's working smoothly. It was probably a side effect of ten years of upgrading with finding limits of 32 bit operating system.

Solution 2

Are there any suspicious messages in dmesg?

Some more tools you could try to gain some insights into your system's bottlenecks:

  • dstat
  • latencytop
  • sysprof
Share:
12,317

Related videos on Youtube

bayindirh
Author by

bayindirh

Updated on September 18, 2022

Comments

  • bayindirh
    bayindirh over 1 year

    I'm using Debian testing without any problems for ~6 years (I'm just regularly updating it), but recently it started to show a random behaviour that can be summarized as "Low I/O performance which persists until reboot".

    The problem is, suddenly all disk reads and writes slow down to ~5MB/sec which results in continuous read and writes. Since the rate is so low, disks are not mechanically challenged or stressed, but everything slows down until I reboot.

    I/O subsystem of the computer consists of one OCZ Vertex 3 SSD and two WD Caviar Black HDDs. SSD holds read-heavy part of the OS and a partition on the HDD holds the rest.

    To diagnose the problem I tried the following without success:

    • top doesn't show any runaway activity neither in CPU nor I/O usage.
    • hdparm returns normal performance ratings of the disks (I only checked -t though).
    • smartctl doesn't show any performance problems in disks. Long tests showed that the disks are as good as new.

    System has Z77 Chipset, 16GB of RAM and Intel i7 3770K CPU and the stats show no signs of saturation in RAM, I/O or CPU, but I'm not experienced to debug problems like this (esp. in kernel space). Any help will be appreciated.

    Update 1:

    • I ran (forced) fsck on every partition as a precaution. All FS are clean.
    • Incidentally I found a BIOS upgrade which came out a month ago & applied it.
    • No partition is filled more than 50%.

    Update 2:

    The problem is not surfacing up for two days. Either fsck or the BIOS update cleaned some clogs in the system. I'm still monitoring the issue and will close the question with a post-mortem answer.

    Update 3:

    Problem just resurfaced and I did some more digging. Please see the answer.

    • Stéphane Chazelas
      Stéphane Chazelas over 10 years
      could be fragmentation issue atop would tell you how busy the disks are (like when seeking all the time).
    • bayindirh
      bayindirh over 10 years
      @StephaneChazelas, I'll look at it when I go home, but the partitions are pretty much empty and generally an event triggers that slowness. If that "event" doesn't happen, system works like it should. Also the disks doesn't sound "that busy" (I worked as an HPC cluster administrator for ~5 years, but never had a problem like this).
    • frostschutz
      frostschutz over 10 years
      Just to rule out some quirks, disable NCQ and set the I/O scheduler to noop.
    • msw
      msw over 10 years
      "Low I/O performance which persists until reboot" can be a broken/buggy device that seizes the bus too often for too long which is maddeningly hard to diagnose short of swapping out hardware.
    • bayindirh
      bayindirh over 10 years
      @msw I can get the full speed from disks with hdparm even in low I/O situation. This makes situation stranger.
    • chrishollinworth
      chrishollinworth over 10 years
      How are your filesystems and block devices configred? Are you using md? LVM?
    • bayindirh
      bayindirh over 10 years
      @symcbean It's vanilla GPT partitioning. Nothing advanced like md or LVM. One of the caviars is single partition storage. Other have parts of the OS as partitions and another storage partition. SSD has the homes and read-heavy parts of the OS
    • chrishollinworth
      chrishollinworth over 10 years
      Then the next thing on my lsit to check would be to check the logs for errors and check there's plenty of memory allocated to buffers/cache (see output of free)
    • Bratchley
      Bratchley over 10 years
      Just to localize it a bit more is iowait shooting up? Either on the whole or on a particular process? If you try to check the logs/schedule self-tests via smartctl does that give you more information to work with?
    • bayindirh
      bayindirh over 10 years
      @JoelDavis smartctl tests are all clean. I've run them during the diagnosis of the problem. Currently I cannot reproduce the issue, but monitoring didn't end yet.
    • Avio
      Avio over 7 years
      I'm having the exact same issue on an old laptop upgraded, version by version, up to 14.04 (kernel 4.4.0-38-generic from xenial-lts). It's being some time now, both read and write speeds are very slow (see: dropbox.com/s/z1fab4fb563bhqn/…). hdparm -t --direct /dev/sda says: /dev/sda: Timing O_DIRECT disk reads: 78 MB in 3.07 seconds = 25.41 MB/sec So it's not a hdd issue, but something about Linux. I suspect this old ext3 filesystem mounted by the ext4 subsystem is the cause, I'll have to do the tedious job of rsyncing all.
    • bayindirh
      bayindirh over 7 years
      @Avio, It might not be an ext3 and ext4 issue. My problem was the number of files in the cache. Can you please try the solution in my answer and can you take a look whether it, at least temporarily, remedies the problem? Also please take look to output of free and size of the hard drive cache and the disk activity when things go down. Does your disk constantly writing something small in that case?
  • bayindirh
    bayindirh over 10 years
    Nothing suspicious in any logs. TBH no log entries related to this problem. I'll try the tools nevertheless. There shouldn't be a bottleneck in a high-end PC while sitting in idle without anything running on it. I think a cache or something related to I/O subsystem goes awry.
  • chrishollinworth
    chrishollinworth over 10 years
    ....and iotop, fio