VMware ESXi 5 Slow performance, hundreds of I/O latency errors.

18,908

Solution 1

Honestly, you may have solved your own problem!

  • You've identified the effects of the issue... and a possible source.
  • You've verified that it can work on a similar setup.
  • You've observed bad behavior on a single machine.
  • You did NOT replace the chassis or backplane. Your issues probably lie there.
  • You bought Supermicro, which does not have the same level of polish or quality-control consistency as IBM, HP or Dell's offerings.

This happens. Replace the server and move on.

Solution 2

Not really a question, but...

It's possible that your RAID controller switched to write-through mode. One reason could be a faulty BBU (or it's learn cycle). This can reduce performance greatly.

Share:
18,908
diomonogatari
Author by

diomonogatari

Updated on September 18, 2022

Comments

  • diomonogatari
    diomonogatari almost 2 years

    We have a standalone ESXi5 Server with the follow hardware specs: - Supermicro X8DTL - Intel Xeon(R) CPU E5506 2.13GHz - 25G Ram - 1TB HD (mirrored RAID, local SATA)

    We have around 17 VM's running, with ~512MB each. Running web+db servers.

    Around a month ago we had the server crash, on investigation we found errors similar to these in the /scratch/log/vobd.log:

    2013-02-21T23:30:14.054Z: [scsiCorrelator] 1657239493834us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310595 microseconds to 260642 microseconds.
    2013-02-21T23:30:17.888Z: [scsiCorrelator] 1657243328201us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 260642 microseconds to 85292 microseconds.
    2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657264714482us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds.
    2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657263440772us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds.
    2013-02-21T23:30:42.796Z: [scsiCorrelator] 1657268235408us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310310 microseconds to 257850 microseconds.
    2013-02-21T23:30:44.392Z: [scsiCorrelator] 1657269831493us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 257850 microseconds to 86289 microseconds.
    2013-02-21T23:32:29.119Z: [scsiCorrelator] 1657374559512us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds.
    2013-02-21T23:32:29.120Z: [scsiCorrelator] 1657373285533us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds.
    2013-02-21T23:32:35.673Z: [scsiCorrelator] 1657381113191us: [vob.scsi.device.io.latency.improved] Device mp
    

    On the day of the crash we had almost 5000 of these errors, since then we have had as low as 2 per day up to as high as 500 (though no full server crashes). On the guest VM's we are experiencing slowness reading/writing to disk during normal use. Something as simple as a find command on / causes large spikes in the performance chart.

    We have replaced both HD's and the RAID controller. A server with identical setup and a similar amount of VM's does not have these issues. Before the first crash (the one with 5k errors) the performance was fine, however logs still show the same error in place ~30-40 times a day. A few days before this crash we did thin provision a large (160GB) HD for a guest VM.

    The following is (date,the number of times that error message pops up,average of the latencies logged before the error (MS) and average after.(MS) )

    2012-10-24    16           976     138,666
    2012-10-28    12         1,020      40,421
    2012-11-05    16         1,167     273,223
    2012-11-06    20         1,226      89,181
    2012-11-07    40         1,314     224,957
    2012-11-08    48         1,378     165,349
    2012-11-09    42         1,441     174,061
    2012-11-10    26         1,519     218,381
    2012-11-11     8         1,567     112,229
    2012-11-12    24         1,593     233,350
    2012-11-13    54         1,641     193,695
    2012-11-14    80         1,692     222,456
    2012-11-15    32         1,738     243,640
    2012-11-16    66         1,776     325,366
    2012-11-17    30         1,816     176,468
    2012-11-18    38         1,850     264,176
    2012-11-20    12         1,846     117,589
    2012-11-21    34         1,868     252,732
    2012-11-22    44         1,895     166,636
    2012-11-23    12         1,926     123,632
    2012-11-26     4         1,892      98,791
    2012-11-27    14         1,899     184,382
    2012-11-28    20         1,916     178,908
    2012-11-29    10         1,923     134,338
    2012-11-30     6         1,923      69,203
    2012-12-01     2         1,924      60,052
    2012-12-02     4         1,919     122,631
    2012-12-03     8         1,898     126,051
    2012-12-04    54         1,909     199,758
    2012-12-05   462         2,109     394,950
    2012-12-06    36         2,228     191,166
    2012-12-07    64         2,245     204,348
    2012-12-08    32         2,271     294,890
    2012-12-10   140         2,290     302,435
    2012-12-11   314         2,386     311,973
    2012-12-12   150         2,475     261,258
    2012-12-13   160         2,532     236,761
    2012-12-14   114         2,585     206,043
    2012-12-15    84         2,618     211,221
    2012-12-16    52         2,640     256,677
    2012-12-17    18         2,637     180,975
    2012-12-18    62         2,649     228,785
    2012-12-19    92         2,669     199,357
    2012-12-20   160         2,707     275,119
    2012-12-21   124         2,749     245,460
    2012-12-22     2         2,763     102,838
    2012-12-26   144         2,736     302,383
    2012-12-27   140         2,776     292,725
    2012-12-28    64         2,813     274,609
    2012-12-30   106         2,811     231,112
    2012-12-31   148         2,853     295,416
    2013-01-01    12         2,881     204,615
    2013-01-04     4         2,860      90,300
    2013-01-09   246         2,849     279,765
    2013-01-10   278         2,909     301,014
    2013-01-11   242         2,966     294,417
    2013-01-12    92         3,006     308,232
    2013-01-14   248         3,036     271,435
    2013-01-15   426         3,172     233,094
    2013-01-16   388         3,313     276,185
    2013-01-17   342         3,423     282,632
    2013-01-18   298         3,517     255,919
    2013-01-19   232         3,579     287,905
    2013-01-20     8         3,611     128,877
    2013-01-21     2         3,614     121,942
    2013-01-22   142         3,667     265,338
    2013-01-23   402         3,738     281,091
    2013-01-24   332         3,826     280,295
    2013-01-25   178         3,892     270,747
    2013-01-26   280         4,018     319,368
    2013-01-27   106         4,075     293,760
    2013-01-28   610         4,187     213,410
    2013-01-29   784         4,700     222,077
    2013-01-30   386         5,236     258,133
    2013-01-31  4580         8,261   1,681,902
    2013-02-01     2        11,211     339,135
    2013-02-02    10        38,909   1,200,144
    2013-02-04    18        88,573   2,692,687
    2013-02-05   190        67,454   2,094,093
    2013-02-06   460        58,534   1,858,435
    2013-02-07    98        57,683   1,795,912
    2013-02-08    62        54,012   1,671,730
    2013-02-09    88        52,681   1,711,773
    2013-02-10    66        51,016   1,549,408
    2013-02-11    84        48,885   1,639,267
    2013-02-12   206        48,364   1,829,969
    2013-02-13   562        48,651   1,774,433
    2013-02-14   170        48,957   1,655,395
    2013-02-15   124        47,055   1,550,294
    2013-02-16   140        46,099   1,588,326
    2013-02-17   110        45,283   1,485,211
    2013-02-18    34        43,836   1,356,562
    2013-02-19   326        43,608   1,484,757
    2013-02-20   224        43,894   1,581,129
    2013-02-21   296        43,626   1,568,687
    

    At this point we are pretty much at a loss, the best answer we have is that since we are using SATA drives (which is probably a terrible idea) we are hitting a big bottleneck. We are planning on moving to a SAN with SAS drives but we want to make sure the problem doesnt follow us.

    Thanks

    • ewwhite
      ewwhite over 11 years
      Supermicro... SATA... bummer.
    • Chopper3
      Chopper3 over 11 years
      @ewwhite seconded, like using a motorbike to transport tanks
    • funkaoshi
      funkaoshi over 11 years
      I think this question might be related to the problems you are having: serverfault.com/questions/231496/…. They are running multiple VMs on a single server, with two SAS drives, as seeing problems.