RAID 10 Stripe Size for XenServer

raid xenserver optimization raid10

6,202

Solution 1

I am going to try and sum up my comments into an answer. The basic line is:

You should not tinker with the strip size unless you have good evidence that it will benefit your workload.

Reasoning:

For striping, you have to choose some strip size and 64 KB is the default the manufacturer has chosen. As the manufacturer (LSI in this case, rebranded by Dell) does have a shitload of experience running a vast number of setups with different RAID levels and workloads, you might just trust them to have chosen wisely
64 KB is likely to roughly match the average size of your requests in a virtualized environment (at least much more so than 256KB or 1 MB) and thus be a good trade-off between latency and seek time optimizations¹.
accurate model-driven predictions about application performance with varying strip sizes are close to impossible due to the highly variant nature of workloads and the complexity of the models taking into account different read-ahead and caching algorithms at different layers

If you are the kind to get this evidence, you can do so by running your typical load and some of the atypical load scenarios with different strip size configurations, gather the data (I/O subsystem performance at the Xen Server layer, backend server performance and answer times at the application layer) and run it through a statistical evaluation. This however will be extremely time-consuming and is not likely to produce any groundbreaking results apart from "I might just have left it at default values in the end", so I would consider it a waste of resources.

¹ If you assume a transfer rate of 100MB/s for a single disk, it is rather easy to see that a Kilobyte takes around 0,01ms to read, thus 64 KB will have a reading latency of 0,64ms. Considering that the average "service time" of a random I/O request typically will be in the range of 5-10ms, the reading latency is only is a small fraction of the total wait time. On the other hand, reading 512 KB will take around 5ms - which will matter for the "random small read" type of workload, considerably reducing the number of IOPS your array will be able to deliver in this specific case by the factor of 1.5 - 2. A scenario with concurrent random large read operations is going to benefit as larger block reads will induce less time-consuming seeks, but you are very unlikely to see this scenario in a virtualized environment.

Solution 2

The general rule of thumb with RAID10 is that smaller chunk size give you fast sequential transfer in a wider range of cases, while larger chunks provide higher IOPs and higher sequential speed in selected scenarios.

Your expected workload (Virtual Machines) are all about issuing small-to-medium (< 256KB), pseudo-random requests. In other words, you need a RAID stripe configuration that minimize response time and maximize pseudo-random IOPs.

While 64K is a safe default value, I feel it a little small for your expected workload. For example, consider that case where a VM want to read/write a 128KB data chunk. Based on how your controller handle read requests, a 128KB chunk will engage 2 or 4 disks, while a write requests of that size will always engage all your 4 disks. At the same time, due to the small size of the read/written chunk (128KB), I/O performance will be dominated by seek time and not sequential transfer, so your real transfer speed will be only a little faster than a single disk can provide. This means that the other VMs have very little chances to use the disks, but at the same time your array is providing you will single-disk like performance for the single VM that is actively using it.

For VMs use, I will configure the array with 256K or 512K chunk sizes: this guarantee that small (< 256KB) read requests will be served by a single disk (2 disks for writes), leaving the other available for other VMs. At the same time, large sequential transfer (> 256/512K) will be very fast, as they will engage multiple disk.

6,202

Reado

Updated on September 18, 2022

Comments

Reado almost 2 years

Below is our current server configuration. In a few weeks I will be simulating a disaster recovery by installing 5 new disks (1 hot spare) and restoring all VMs from the backups.

Will I gain anything by changing the RAID stripe size to something other than 64KB? The RAID controller has options for 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1MB.

Any recommendations based on the specification below would be greatly appreciated - thanks.

Hardware:

Dell PowerEdge 2900 III
Dell PERC 6/i
Intel Xeon 2.5GHz (x2)
32GB RAM
Seagate ST32000645SS ES.2 2TB Near-Line SAS 7.2K (x4)

Software:

Citrix XenServer 6.2 SP1
VM - Windows SBS 2008 x64 - Exchange & multiple SQL express instances
VM - Windows Server 2003 R2 x86 - single SQL express instance
VM - CentOS 6.6 x64 (x2) - cPanel & video transcoding and streaming
VM - CentOS 6.3 x86 - Trixbox (VoIP)
VM - PHD Virtual Backup 6.5.3 (running Ubuntu 12.04.1 LTS)

Configuration:

RAID 10, 64k Stripe Size

the-wabbit over 9 years

Special cases with sequential workloads taken aside, only a very insignificant amount of general I/O in a virtualization environment does exceed 32kb per request, with typical request sizes being 4-16 KB. That being said, reading a 256 KB strip off a disk is only marginally slower than reading 16 KB, so I/O latency would not suffer much. By blindly reading more than was initially requested, the controller implicitly is implementing a dumb read-ahead algorithm, which might or might not prove advantageous in a particular scenario. I would not expect great differences for most workloads.
shodanshok over 9 years

While single application requests are often in the 4K range, OS generally uses bigger size for reading, say, a program library or mmap-ed files. Moreover, being pseudo (and not really) random in nature, some requests can be coalesced and hence satisfied as a single, bigger request. Finally, for small requests is the seek time that dominate I/O latency, not reading from the platter. After a seek, is so cheap to read 256-512KB of data that you should anyway do that, irrespective of chunk size (this is the read-ahead policy that both disks firmware, controllers and OS heavily use).
the-wabbit over 9 years

I am not talking about the application requests but about what a shared storage does see in terms of requests from a number of virtual machines. It should be easy enough to verify - running iostat -xdm 60 or looking at sar -d statistics on the asker's Xen server will most likely show an average request size (after merging) of ~16K most of the time. One of my backup machines with an extremely sequential workload has avgrq-sz values in the range of 300-500, but this is nothing you will ever see on shared storage in a virtualized environment.
Reado over 9 years

@the-wabbit, iostat reports 21.15 (sda), 17.57 (sda1), 0.00 (sda2), 21.27 (sda3). sar -d results in "Requested activities not available in file". So are you saying I should be decreasing the value to maybe 32K instead? Or should I be running iostat when the server is under load?
shodanshok over 9 years

@the-wabbit: sure a big share of read/write requests falls under the 4~16 KB ranges. For this reason, I consider 64KB a "safe bet" for chunk size. However, I've seen workloads (es: linux mailserver) where the mean r/w request size increase to >64 KB but <256KB. In a similar scenario, a 64KB chunk size, while reasonable, is suboptimal respect a 256/512KB chunk size (for reason explained above).
shodanshok over 9 years

@Reado: avgrq-sz (as reported by iostat) is in 512-byte units. This means that your workload fall squarely in the 4-16KB range suggested by the-wabbit. In other word, 64 KB or 256/512KB chunk size will make almost no difference here. However, as stated in the comment above, I not-so-rarely saw workloads where the bigger chunk size give an advantage in system/vm responsiveness and IOPs.
the-wabbit over 9 years

@Reado you absolutely should watch the numbers under load. But what I really am saying is that the impact of the strip size will likely be too minor to bother in your setup.
Reado over 9 years

avgrq-sz now hovering at 211 as the backups are running. The backups always run slow. Could increasing the stripe improve this?
Reado over 9 years

Correct me if I'm wrong but am I right in thinking that 211 * 512 bytes / 1024 = 105KB? Therefore increasing to at least 128KB should see an improvement?
the-wabbit over 9 years

@reado I'll try to re-phrase: while it is possible that changing the strip size would be advantageous for some specific workloads, it would be disadvantageous for others. As "virtualization" is pretty much mixed workload by definition, you are not going to gain much if you are deviating from the average. Outer effects like controller's and OS's read-ahead algorithms which make the exact prediction of results for any given setup rather difficult. There is no general rule of thumb for this - if you really need to know, run the same workload for different strip sizes and watch the numbers.
shodanshok over 9 years

While I agree that a real-case test is the only accurate method to select the best chunk size, RAID10 penalty does not work as you described. 1) Even with a large chunk size, OS/controller does not need to read the entire data chunk to retrieve, say, a 16KB block. This means that random read IOPS are basically unaffected by chunk size unless chunk size is too small. 2) Read-ahead is a different thing, not directly correlated to chunk size. Anyway, OS are very aggressive at read-ahead (often reading MB of additional data). Even disk firmware often performs use a 64-256K read-ahead window.
shodanshok over 9 years

To reiterate point 1: here I did a in-depth analysis of RAID layout. As you can see from the first graph, with a single data stream (that, being 4K in nature, hammered 1 disk at a time) a single 7200 RPM disk in a RAID10 array with 512KB chunk size give about 130 IOPS. Repeating the same test with, say, a 64 KB chunk size is not going to show any improvements (nor deficit, for what matters).
Reado over 9 years

+1 Some good knowledge has been shared here! However I think I'll stick to 64KB as I now no longer see any reason why it should be different. Thanks for taking the time to answer my query.
the-wabbit over 9 years

@shodanshok you are right, technically it would not be necessary to read the entire stripe unless data validation involving a checksum over the entire strip were to be performed in the same step. I could not find any claims that LSI's controllers are doing this, I know for sure that ZFS is.
shodanshok over 9 years

ZFS is an entirely different matter. As total IOPs, in a ZFS array, remain at single disk level, a smaller chunk size can be beneficial.
Reado over 9 years

Found this on the Dell website - "Higher stripe sizes create less physical I/O operations and improve PCI-Express bus utilization. 512K was selected as it is commonly used in Linux® and was determined to have the best combination of benefits for large I/O and the least detriment for small I/O on this controller across multiple operating systems."
the-wabbit over 9 years

@Reado good find. I am not sure however if everything in this article can be taken at face value. Further down in a section about RAID10 it states "for reads we will utilize only half the disks in the array since the other half are just mirrored copies" which is incorrect - the MegaRAIDs of course would distribute read requests evenly among all members of a mirror. The article is not citing any sources, too.
the-wabbit over 9 years

@shodanshok I believe you are referring to the ZFS RAID-Z penalty for random small reads. But the checksum (fletcher or SHA) verification indeed is always performed with ZFS, even in a single-disk pool, so an entire block will be read either way. In a spanned pool of mirrored vdevs this would mean at least block-sized reads for every request. As I wrote, I could not find any references to data integrity features in mirrorsets for the MegaRAID.
shodanshok over 9 years

@the-wabbit 512KB is also the actual default of Linux MDADM. In the past, Linux software RAID used a 64KB chunk size that, while reasonable and "safe", was deemed too small by RedHat, which changed it (inside anaconda) to use, you know, a 512KB chunk ;) About MegaRaid controller: a single read request will probably engage a single disk from the mirror pair, but multiple read requests should be served by multiple disks. So, in a 4 disk RAID10 setup, a single (even if big) read will engage only 2 of the 4 disks (the striped ones), but the other 2 remain available for others read requests.