Linux's md driver is under active development, and documentation and performance numbers are lagging behind. I recently (mid 2006) built a 4-drive software RAID 5 array under linux. Here are some performance numbers I obtained, and some methodology to go with them. Also some RAID 6 performance numbers and the beginnings of an investigation of performance versus number of drives in an array.
First, the test system setup:
- MSI K8N Neo4 Platinum motherboard
- AMD Athlon64 3000+ CPU
- Crucial 1GB DDR1-400 DIMM
- 4 x Seagate 7200.9 300GB SATA HDDs
- Fedora Core 5 X86-64 linux distribution
- uname: 2.6.15-1.2054_FC5 #1 SMP Tue Mar 14 15:48:20 EST 2006 x86_64 x86_64 x86_64 GNU/Linux
- md driver / mdadm (no raidtools package)
Tests were performed with drives attached to the Silicon Image controller (which is bottlenecked by being a PCI device), attached to the nForce chipset, and split between the two. If no indication of which setup was used for a given benchmark, the drives were attached to the chipset's SATA ports.
Array created with:
Formatted with an ext3 filesystem with the default number of inodes, ext3 with minimum inodes, and xfs:
mkfs.ext3 -N 100 -m 0 /dev/md0
mkfs.xfs /dev/md0
Write tests performed by reading from a 20GB file (to minimise caching effects with 1GB of RAM):
Read tests performed on the just-written 20GB file:
Sometimes array rebuilds and resyncs are very slow, for no good reason. This seems to be due to md trying to minimise the effect of its rebuilds upon actual user interaction with the array. Rebuild speed is controlled by two parameters: /proc/sys/dev/raid/speed_limit_min and speed_limit_max. I would expect the rebuild to proceed at max speed unless other I/Os are being slowed down, in which case it would slow as far as the min speed in an effort to serve user requests quickly. In practice, it appears to rebuild at the minimum speed in all cases except during initial array creation. To fix this, increase the minimum (target) rebuild speed:
Normal array usage will involve small numbers of large sequential reads or writes, so I tuned the array for such access patterns by increasing the readahead value to 2MB, which gave me large sequential reads and writes at greater than gigabit speed. No other tuning was performed.
768
.../root]# blockdev --setra 2048 /dev/md0
First, benchmarks of filesystem creation speed on an undegraded, unloaded array:
Trial 0 | Trial 1 | Trial 2 | Trial 3 | Trial 4 | Mean | |
---|---|---|---|---|---|---|
real | 0m50.416s | 0m49.943s | 0m50.195s | 0m49.916s | 0m50.171s | 0m50.128 |
user | 0m0.072s | 0m0.084s | 0m0.084s | 0m0.096s | 0m0.108s | 0m0.888s |
sys | 0m0.664s | 0m0.756s | 0m0.680s | 0m0.740s | 0m0.696s | 0m0.707s |
Trial 0 | Trial 1 | Trial 2 | Trial 3 | Trial 4 | Mean | |
---|---|---|---|---|---|---|
real | 5m11.777s | 5m12.400s | 5m11.898s | 5m11.567s | 5m11.021s | 5m11.733s |
user | 0m0.440s | 0m0.400s | 0m0.440s | 0m0.420s | 0m0.384s | 0m0.416s |
sys | 0m18.665s | 0m18.637s | 0m18.741s | 0m18.569s | 0m18.793s | 0m18.681s |
Trial 0 | Trial 1 | Trial 2 | Trial 3 | Trial 4 | Mean | |
---|---|---|---|---|---|---|
real | 0m3.721s | 0m3.740s | 0m3.705s | 0m3.783s | 0m4.140s | 0m3.818s |
user | 0m0.000s | 0m0.004s | 0m0.000s | 0m0.004s | 0m0.000s | 0m0.002s |
sys | 0m0.828s | 0m0.900s | 0m0.956s | 0m1.044s | 0m0.892s | 0m0.924s |
The ext3 filesystem with minimum inodes (256/GiB, or 214656 total), and the xfs filesystem (which dynamically allocates inodes) both completed quickly, though xfs was the clear winner. ext3 with the default number of inodes (131328/GiB, 110102112 total), wrote 13GB of inodes, and took several minutes to complete.
Untuned filesystem performance was fairly low. Tested completely only with ext3, but limited xfs results were in line with these as well:
Large reads | Large writes | |
---|---|---|
4 drives on SiI3114 controller | 70MB/sec | 35MB/sec |
2 drives on SiI; 2 on nVidia controller | 70MB/sec | 60MB/sec |
4 drives on nVidia controller | 75MB/sec | 60MB/sec |
Readahead tuning shows 2MB to be a good size for my workloads. Yours may vary somewhat, but the default 768k seems too small. Sustained transfer rates v. readahead value:

Though I do not have specific benchmark numbers, one interesting side effect of the Silicon Image controller being bottlenecked (due to the PCI bus) at about 100MB/sec throughput across all 4 drives is that writes to a degraded 3-drive array on the controller are faster than writes to the full array. When writing to the degraded array, only 3/4 of the data needs to be pushed through the PCI bus and the last quarter is discarded, so writes are about 1/3 faster. Reads occur at the same speed since the CPU is fast enough to reconstruct the third data stripe from the first two data stripes and the parity stripe at the same speed as the third data stripe could be read, were it present in the array.
Update: Just a few quick numbers from a similar setup (different board, but same chipset and PCI SATA controller, using 320GB Seagate ST3320620AS drives). 4 drives on SiI3114, 2 drives on nForce4 southbridge: 75MB/sec writes, 140MB/sec reads. Reads would be faster with 4 drives on the chipset's ports and only 2 stuck behind the PCI bus. I may get those numbers tomorrow while I wait for longer SATA cables to connect the last two drives.
Update: On a 2.0Ghz Athlon 64, my best RAID 5 checksumming speed is with the pIII_sse code: 6121.000 MB/sec. On the same system, RAID 6 checksumming using mmxx2 reaches 3005 MB/sec, though md insists on using the slower sse2x2 at 2127 MB/sec.