VMware ESXi 5 Slow performance, hundreds of I/O latency errors.
Solution 1
Honestly, you may have solved your own problem!
- You've identified the effects of the issue... and a possible source.
- You've verified that it can work on a similar setup.
- You've observed bad behavior on a single machine.
- You did NOT replace the chassis or backplane. Your issues probably lie there.
- You bought Supermicro, which does not have the same level of polish or quality-control consistency as IBM, HP or Dell's offerings.
This happens. Replace the server and move on.
Solution 2
Not really a question, but...
It's possible that your RAID controller switched to write-through mode. One reason could be a faulty BBU (or it's learn cycle). This can reduce performance greatly.
diomonogatari
Updated on September 18, 2022Comments
-
diomonogatari almost 2 years
We have a standalone ESXi5 Server with the follow hardware specs: - Supermicro X8DTL - Intel Xeon(R) CPU E5506 2.13GHz - 25G Ram - 1TB HD (mirrored RAID, local SATA)
We have around 17 VM's running, with ~512MB each. Running web+db servers.
Around a month ago we had the server crash, on investigation we found errors similar to these in the /scratch/log/vobd.log:
2013-02-21T23:30:14.054Z: [scsiCorrelator] 1657239493834us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310595 microseconds to 260642 microseconds. 2013-02-21T23:30:17.888Z: [scsiCorrelator] 1657243328201us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 260642 microseconds to 85292 microseconds. 2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657264714482us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds. 2013-02-21T23:30:39.275Z: [scsiCorrelator] 1657263440772us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43610 microseconds to 1310310 microseconds. 2013-02-21T23:30:42.796Z: [scsiCorrelator] 1657268235408us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 1310310 microseconds to 257850 microseconds. 2013-02-21T23:30:44.392Z: [scsiCorrelator] 1657269831493us: [vob.scsi.device.io.latency.improved] Device mpx.vmhba2:C0:T0:L0 performance has improved. I/O latency reduced from 257850 microseconds to 86289 microseconds. 2013-02-21T23:32:29.119Z: [scsiCorrelator] 1657374559512us: [vob.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds. 2013-02-21T23:32:29.120Z: [scsiCorrelator] 1657373285533us: [esx.problem.scsi.device.io.latency.high] Device mpx.vmhba2:C0:T0:L0 performance has deteriorated. I/O latency increased from average value of 43613 microseconds to 1405607 microseconds. 2013-02-21T23:32:35.673Z: [scsiCorrelator] 1657381113191us: [vob.scsi.device.io.latency.improved] Device mp
On the day of the crash we had almost 5000 of these errors, since then we have had as low as 2 per day up to as high as 500 (though no full server crashes). On the guest VM's we are experiencing slowness reading/writing to disk during normal use. Something as simple as a find command on / causes large spikes in the performance chart.
We have replaced both HD's and the RAID controller. A server with identical setup and a similar amount of VM's does not have these issues. Before the first crash (the one with 5k errors) the performance was fine, however logs still show the same error in place ~30-40 times a day. A few days before this crash we did thin provision a large (160GB) HD for a guest VM.
The following is (date,the number of times that error message pops up,average of the latencies logged before the error (MS) and average after.(MS) )
2012-10-24 16 976 138,666 2012-10-28 12 1,020 40,421 2012-11-05 16 1,167 273,223 2012-11-06 20 1,226 89,181 2012-11-07 40 1,314 224,957 2012-11-08 48 1,378 165,349 2012-11-09 42 1,441 174,061 2012-11-10 26 1,519 218,381 2012-11-11 8 1,567 112,229 2012-11-12 24 1,593 233,350 2012-11-13 54 1,641 193,695 2012-11-14 80 1,692 222,456 2012-11-15 32 1,738 243,640 2012-11-16 66 1,776 325,366 2012-11-17 30 1,816 176,468 2012-11-18 38 1,850 264,176 2012-11-20 12 1,846 117,589 2012-11-21 34 1,868 252,732 2012-11-22 44 1,895 166,636 2012-11-23 12 1,926 123,632 2012-11-26 4 1,892 98,791 2012-11-27 14 1,899 184,382 2012-11-28 20 1,916 178,908 2012-11-29 10 1,923 134,338 2012-11-30 6 1,923 69,203 2012-12-01 2 1,924 60,052 2012-12-02 4 1,919 122,631 2012-12-03 8 1,898 126,051 2012-12-04 54 1,909 199,758 2012-12-05 462 2,109 394,950 2012-12-06 36 2,228 191,166 2012-12-07 64 2,245 204,348 2012-12-08 32 2,271 294,890 2012-12-10 140 2,290 302,435 2012-12-11 314 2,386 311,973 2012-12-12 150 2,475 261,258 2012-12-13 160 2,532 236,761 2012-12-14 114 2,585 206,043 2012-12-15 84 2,618 211,221 2012-12-16 52 2,640 256,677 2012-12-17 18 2,637 180,975 2012-12-18 62 2,649 228,785 2012-12-19 92 2,669 199,357 2012-12-20 160 2,707 275,119 2012-12-21 124 2,749 245,460 2012-12-22 2 2,763 102,838 2012-12-26 144 2,736 302,383 2012-12-27 140 2,776 292,725 2012-12-28 64 2,813 274,609 2012-12-30 106 2,811 231,112 2012-12-31 148 2,853 295,416 2013-01-01 12 2,881 204,615 2013-01-04 4 2,860 90,300 2013-01-09 246 2,849 279,765 2013-01-10 278 2,909 301,014 2013-01-11 242 2,966 294,417 2013-01-12 92 3,006 308,232 2013-01-14 248 3,036 271,435 2013-01-15 426 3,172 233,094 2013-01-16 388 3,313 276,185 2013-01-17 342 3,423 282,632 2013-01-18 298 3,517 255,919 2013-01-19 232 3,579 287,905 2013-01-20 8 3,611 128,877 2013-01-21 2 3,614 121,942 2013-01-22 142 3,667 265,338 2013-01-23 402 3,738 281,091 2013-01-24 332 3,826 280,295 2013-01-25 178 3,892 270,747 2013-01-26 280 4,018 319,368 2013-01-27 106 4,075 293,760 2013-01-28 610 4,187 213,410 2013-01-29 784 4,700 222,077 2013-01-30 386 5,236 258,133 2013-01-31 4580 8,261 1,681,902 2013-02-01 2 11,211 339,135 2013-02-02 10 38,909 1,200,144 2013-02-04 18 88,573 2,692,687 2013-02-05 190 67,454 2,094,093 2013-02-06 460 58,534 1,858,435 2013-02-07 98 57,683 1,795,912 2013-02-08 62 54,012 1,671,730 2013-02-09 88 52,681 1,711,773 2013-02-10 66 51,016 1,549,408 2013-02-11 84 48,885 1,639,267 2013-02-12 206 48,364 1,829,969 2013-02-13 562 48,651 1,774,433 2013-02-14 170 48,957 1,655,395 2013-02-15 124 47,055 1,550,294 2013-02-16 140 46,099 1,588,326 2013-02-17 110 45,283 1,485,211 2013-02-18 34 43,836 1,356,562 2013-02-19 326 43,608 1,484,757 2013-02-20 224 43,894 1,581,129 2013-02-21 296 43,626 1,568,687
At this point we are pretty much at a loss, the best answer we have is that since we are using SATA drives (which is probably a terrible idea) we are hitting a big bottleneck. We are planning on moving to a SAN with SAS drives but we want to make sure the problem doesnt follow us.
Thanks
-
ewwhite over 11 yearsSupermicro... SATA... bummer.
-
Chopper3 over 11 years@ewwhite seconded, like using a motorbike to transport tanks
-
funkaoshi over 11 yearsI think this question might be related to the problems you are having: serverfault.com/questions/231496/…. They are running multiple VMs on a single server, with two SAS drives, as seeing problems.
-