Poor write performance of software RAID10 array of 8 SSD drives
The measured low performance are the results of various factors:
- after creation the array is entirely synched, causing the allocation of most (if not all) flash data pages on half the SSDs. This will put the SSDs in a low performance state until a secure erase / trim "frees" all/most/some pages. This explain the increased performance after an
fstrim
; - the (default) 512 KB chunk size is too much for maximum sequential/streaming performance (as benchmarked with
dd
). With an all-SSDs array I would select a 64 KB chunk size and, probably (but this should be confirmed with real-world test), with "far" layout. Please note that decreasing the chunk size, while benefical for streaming accesses, can penalize random reads/writes. This is mainly a concern with HDDs, but even SSDs can be somewhat affected; - by default, the linux kernel issues at most 512 KB sized I/O. This means that, even when asking
dd
to use 1 GB blocks (as per your first command), these will be split in a myriad of 512 KB-sized requests. Coupled with your 512 KB-sized chunk, this will engage a single SSD per write request, basically capping streaming write performance at single-SSD level and denying any potential speed increase due to RAID. While you can use themax_sectors_kb
tunable (found in/sys/block/sdX/queue/max_sectors_kb
), values bigger than 512 KB can (in some configuration/kernel versions) be ignored; - finally, while interesting and a obligatory first-stop,
dd
itself is a poor benchmark: it only tests streaming peformance at low (1) queue depth. Even with your current array config, a more comprehensive test asfio
would show significant performance increase relative to a single-disk scenario, at least in random I/O.
What can you do to correct the current situation? First of all, you must accept to wipe the disks/array; obviously, you need to take backups as first step. Then:
- stop and delete the array (
mdadm -S /dev/md2
) - trim all data blocks on any disk (
blkdiscard /dev/sdX3
) - recreate the array with 64 KB chunks and with the clean flag (
mdadm --create /dev/md2 --level=10 --raid-devices=8 --chunk=64 --assume-clean /dev/sdX3
) - re-bench with
dd
andfio
; - if all looks good, restore your backup.
A last note about your SATA setup: splitting disk in this manner should clearly be avoided to get maximum peformance. That said, your write speed is so low that I would not blame your SATA controller. I would really recreate the array per above instruction before buying anything new.
Related videos on Youtube
Evgeny Terekhov
Updated on September 18, 2022Comments
-
Evgeny Terekhov over 1 year
I have server with Supermicro X10DRW-i motherboard and RAID10 array of 8 KINGSTON SKC400S SSDs; OS is CentOS 6
# cat /proc/mdstat Personalities : [raid10] [raid1] md2 : active raid10 sdj3[9](S) sde3[4] sdi3[8] sdd3[3] sdg3[6] sdf3[5] sdh3[7] sdb3[1] sda3[0] 3978989568 blocks super 1.1 512K chunks 2 near-copies [8/8] [UUUUUUUU] bitmap: 9/30 pages [36KB], 65536KB chunk
—
# mdadm --detail /dev/md2 /dev/md2: Version : 1.1 Creation Time : Wed Feb 8 18:35:14 2017 Raid Level : raid10 Array Size : 3978989568 (3794.66 GiB 4074.49 GB) Used Dev Size : 994747392 (948.67 GiB 1018.62 GB) Raid Devices : 8 Total Devices : 9 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Fri Sep 14 15:19:51 2018 State : active Active Devices : 8 Working Devices : 9 Failed Devices : 0 Spare Devices : 1 Layout : near=2 Chunk Size : 512K Name : ---------:2 (local to host -------) UUID : 8a945a7a:1d43dfb2:cdcf8665:ff607a1b Events : 601432 Number Major Minor RaidDevice State 0 8 3 0 active sync set-A /dev/sda3 1 8 19 1 active sync set-B /dev/sdb3 8 8 131 2 active sync set-A /dev/sdi3 3 8 51 3 active sync set-B /dev/sdd3 4 8 67 4 active sync set-A /dev/sde3 5 8 83 5 active sync set-B /dev/sdf3 6 8 99 6 active sync set-A /dev/sdg3 7 8 115 7 active sync set-B /dev/sdh3 9 8 147 - spare /dev/sdj3
I've noticed that write speed is just terrible, not even close to SSD performance.
# dd if=/dev/zero of=/tmp/testfile bs=1G count=1 oflag=dsync 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 16.511 s, 65.0 MB/s
Read speed is fine though
# hdparm -tT /dev/md2 /dev/md2: Timing cached reads: 20240 MB in 1.99 seconds = 10154.24 MB/sec Timing buffered disk reads: 3478 MB in 3.00 seconds = 1158.61 MB/sec
After doing some troubleshooting on the issue, I found out that probably I've messed up the storage configuration initially: X10DRW-i has Intel C610 which has two separate SATA controllers, 6-port SATA and 4-port sSATA. So disks in the array are connected to different controllers, and I believe this is the root cause of poor performance. I have only one idea of fixing this: installing PCIe SAS controller (probably AOC-S3008L-L8E) and connecting SSD drives to it.
So I would like to confirm the following:
Am I right about the root cause, or I should double-check something?
Will my solution work?
If I reconnect drives to new controller, will my RAID and data survive? My research shows that yes, as UUIDs of partitions will remain the same, but I just want to be sure.
Thanks to everyone in advance.
UPD:
iostat -x 1
while performing dd test: https://pastebin.com/aTfRYriU# hdparm /dev/sda /dev/sda: multcount = 16 (on) IO_support = 1 (32-bit) readonly = 0 (off) readahead = 256 (on) geometry = 124519/255/63, sectors = 2000409264, start = 0
—
# cat /sys/block/md2/queue/scheduler none
Though AFAIK scheduler is set on physical drives:
# cat /sys/block/sda/queue/scheduler noop anticipatory [deadline] cfq
—
smartctl -a
(on devices, not partitions): https://pastebin.com/HcBp7gUHUPD2:
# dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 14.389 s, 74.6 MB/s
UPD3:
I just have run
fstrim
on / partition and got some effect, still write speed is too low: 227 MB/s, 162 MB/s, 112 MB/s, 341 MB/s, 202 MB/s in five consecutive tests.-
Mircea Vutcovici over 5 yearsHi, could you please run
iostat -x 1
while you run thedd
benchmark test? We will be able to see where is the bottleneck. -
Arlion over 5 yearsCan you please post hdparm of one of the hard drives? Include the output of
cat /sys/block/md2/queue/scheduler
Addfor x in {a..h}; do smartctl -a /dev/sd${x}3 ; done
-
Evgeny Terekhov over 5 years@mircea-vutcovici thanks for your reply! I've updated my post
-
Evgeny Terekhov over 5 years@arlion thanks for your reply! I've updated my post
-
Michael Hampton over 5 years
bs=1G
seems wrong, try a more reasonable block size as @shodanshok suggested. -
TomTom over 5 yearsOk, that IS low. Write cache enabled on your Raid controller?
-
shodanshok over 5 years@TomTom it is software raid (mdraid), so the only cache is the disk's own private DRAM cache.
-
Evgeny Terekhov over 5 years@shodanshok thanks, it's about 75MB/s, still too low, added it to initial post.
-
Evgeny Terekhov over 5 years@shodanshok FS is ext4, losing data is not an option
-
wazoox over 5 yearsIs the SATA controller configured properly, as AHCI, in the BIOS Setup?
-