The Monkey House

< Computing : New File Server : Linux software RAID 5 performance >


Linux's md driver is under active development, and documentation and performance numbers are lagging behind. I recently (mid 2006) built a 4-drive software RAID 5 array under linux. Here are some performance numbers I obtained, and some methodology to go with them. Also some RAID 6 performance numbers and the beginnings of an investigation of performance versus number of drives in an array.

First, the test system setup:

Tests were performed with drives attached to the Silicon Image controller (which is bottlenecked by being a PCI device), attached to the nForce chipset, and split between the two. If no indication of which setup was used for a given benchmark, the drives were attached to the chipset's SATA ports.

Array created with:

mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

Formatted with an ext3 filesystem with the default number of inodes, ext3 with minimum inodes, and xfs:

mkfs.ext3 -m 0 /dev/md0
mkfs.ext3 -N 100 -m 0 /dev/md0
mkfs.xfs /dev/md0

Write tests performed by reading from a 20GB file (to minimise caching effects with 1GB of RAM):

dd if=/dev/zero of=/mnt/reallybig bs=1M count=20480

Read tests performed on the just-written 20GB file:

dd if=/mnt/reallybig of=/dev/null oflag=append bs=1M

Sometimes array rebuilds and resyncs are very slow, for no good reason. This seems to be due to md trying to minimise the effect of its rebuilds upon actual user interaction with the array. Rebuild speed is controlled by two parameters: /proc/sys/dev/raid/speed_limit_min and speed_limit_max. I would expect the rebuild to proceed at max speed unless other I/Os are being slowed down, in which case it would slow as far as the min speed in an effort to serve user requests quickly. In practice, it appears to rebuild at the minimum speed in all cases except during initial array creation. To fix this, increase the minimum (target) rebuild speed:

echo 50000 > /proc/sys/dev/raid/speed_limit_min

Normal array usage will involve small numbers of large sequential reads or writes, so I tuned the array for such access patterns by increasing the readahead value to 2MB, which gave me large sequential reads and writes at greater than gigabit speed. No other tuning was performed.

.../root]# blockdev --getra /dev/md0
768
.../root]# blockdev --setra 2048 /dev/md0

First, benchmarks of filesystem creation speed on an undegraded, unloaded array:

time mkfs.ext3 -N 100 -m 0 /dev/md0
Trial 0Trial 1Trial 2Trial 3Trial 4Mean
real0m50.416s0m49.943s0m50.195s0m49.916s0m50.171s0m50.128
user0m0.072s0m0.084s0m0.084s0m0.096s0m0.108s0m0.888s
sys0m0.664s0m0.756s0m0.680s0m0.740s0m0.696s0m0.707s

time mkfs.ext3 -m 0 /dev/md0
Trial 0Trial 1Trial 2Trial 3Trial 4Mean
real5m11.777s5m12.400s5m11.898s5m11.567s5m11.021s5m11.733s
user0m0.440s0m0.400s0m0.440s0m0.420s0m0.384s0m0.416s
sys0m18.665s0m18.637s0m18.741s0m18.569s0m18.793s0m18.681s

time mkfs.xfs -m 0 /dev/md0
Trial 0Trial 1Trial 2Trial 3Trial 4Mean
real0m3.721s0m3.740s0m3.705s0m3.783s0m4.140s0m3.818s
user0m0.000s0m0.004s0m0.000s0m0.004s0m0.000s0m0.002s
sys0m0.828s0m0.900s0m0.956s0m1.044s0m0.892s0m0.924s

The ext3 filesystem with minimum inodes (256/GiB, or 214656 total), and the xfs filesystem (which dynamically allocates inodes) both completed quickly, though xfs was the clear winner. ext3 with the default number of inodes (131328/GiB, 110102112 total), wrote 13GB of inodes, and took several minutes to complete.

Untuned filesystem performance was fairly low. Tested completely only with ext3, but limited xfs results were in line with these as well:

Large readsLarge writes
4 drives on SiI3114 controller70MB/sec35MB/sec
2 drives on SiI; 2 on nVidia controller70MB/sec60MB/sec
4 drives on nVidia controller75MB/sec60MB/sec

Readahead tuning shows 2MB to be a good size for my workloads. Yours may vary somewhat, but the default 768k seems too small. Sustained transfer rates v. readahead value:

RAID 5 sustained read speed v. array readahead value

Though I do not have specific benchmark numbers, one interesting side effect of the Silicon Image controller being bottlenecked (due to the PCI bus) at about 100MB/sec throughput across all 4 drives is that writes to a degraded 3-drive array on the controller are faster than writes to the full array. When writing to the degraded array, only 3/4 of the data needs to be pushed through the PCI bus and the last quarter is discarded, so writes are about 1/3 faster. Reads occur at the same speed since the CPU is fast enough to reconstruct the third data stripe from the first two data stripes and the parity stripe at the same speed as the third data stripe could be read, were it present in the array.

Update: Just a few quick numbers from a similar setup (different board, but same chipset and PCI SATA controller, using 320GB Seagate ST3320620AS drives). 4 drives on SiI3114, 2 drives on nForce4 southbridge: 75MB/sec writes, 140MB/sec reads. Reads would be faster with 4 drives on the chipset's ports and only 2 stuck behind the PCI bus. I may get those numbers tomorrow while I wait for longer SATA cables to connect the last two drives.

Update: On a 2.0Ghz Athlon 64, my best RAID 5 checksumming speed is with the pIII_sse code: 6121.000 MB/sec. On the same system, RAID 6 checksumming using mmxx2 reaches 3005 MB/sec, though md insists on using the slower sse2x2 at 2127 MB/sec.

Version 0.7    |    Content date: 2006-06-30    |    Page last generated: 2013-11-03 15:26 CST