The discussion on using SAN vs. DASD based storage is nearly a religious war(as can be seen in a lot of discussions on pgsql-performance) and in many ways similar to the infamous emacs vs. vi debate.
From personal experience I have found the IBM DS4300 and IBM DS4300 Turbo (basically the same as the DS4300 but with more memory/cache and a hefty markup in price) quite a reliable and basically maintenance free solution.
However - for some workloads those types of SAN are not really that appropriate. A DS4300(which is a now withdrawn from marketing) can do only a bit above 100MB/s of seq IO(nearly independent on the number of disks!) per controller(about 135MB/s if used together) which is really not much when one considers how fast modern hard drives are.
I recently got a SAN Array to play with that looks quite interesting since while expensive it still seems reasonably priced compared to what companies like IBM or others want for similar gear.
The array I got for testing is basically a non-branded LSI/Engenio 3994 with 16 2Gbit 10k 146FC drives and 2GB of battery backed cache per controller.
It is directly connected via two QLogic QLE2460 PCI-Express adapters to a HP DL380 G5 running CentOS 5 for testing.
The first impression of the array is a solid one - it looks very familiar for people that are used to the IBM DS4000 storage line and the Management GUI is basically the same (with an Engenio logo in place of the IBM one).
The controller chassis can hold 16 disks (up from 14 in the older designs) in 3U and the available expansion enclosures have the same capacity and dimensions (up to 6 are supported) and can be added online(untested!) without disruption to ongoing IO.
Due to the use of disks that are only capable of 2Gbit/s, the speed of the two drive channel loops is also limited to 2Gbit/s (using 4GBit FC drives it can be configured to use 4Gbit/s on the drive channels).
The following is not meant as a thorough benchmark of neither the array nor PostgreSQL but rather some ad-hoc testing and playing around to get some impression on the overall performance characteristics of the device and are done using ext3(I'm fully aware of the fact that other file systems - especially XFS - might provide noticeable better streaming performance, but I have a much higher level of trust in ext3 and that's my choice in production environments) in the default journaling mode.
In the following test(test case 1) we use two volume groups - each a RAID10 (8 disks) and a RAID0 in the OS and write cache mirroring between the controllers(keeps both controller caches in sync so in case one controller fails the other one can take over). To utilize both controllers the HBAs are set up so that controller A is using on and controller B the other.
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
convm004 16G 51121 98 188347 76 97426 28 58961 98 378240 38 732.5 2
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
512 19599 94 257598 100 8272 27 19270 91 331205 99 4879 17
convm004,16G,51121,98,188347,76,97426,28,58961,98,378240,38,732.5,2,512,19599,94,257598,100,8272,27,19270,91,331205,99,4879,17
and the same with write mirroring disabled for both logical volumes (test case 2):
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
convm004 16G 51372 99 235020 96 122183 35 58880 98 369848 37 723.0 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
512 19888 95 256732 99 10037 32 19704 93 332286 99 5541 19
convm004,16G,51372,99,235020,96,122183,35,58880,98,369848,37,723.0,1,512,19888,95,256732,99,10037,32,19704,93,332286,99,5541,19
so write mirroring seems to have a 20% penalty for sequential writes and rewriting but not much impact for others - so it might be worth keeping it turned on due to the additional data integrity guarantees it provides .
It further seems that the device seems to be bottlenecked by the speed of the drive channels (there are two loops in the array head and half the drives are on the one and the other half on the other) due to the 2Gbit disks.
But it also shows that the devices seems to scale fairly well - until it hit's the bandwidth limit - at least for RAID10.
and now for comparison a test using only volume group and a single controller (test case 3):
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
convm004 16G 51346 98 134822 56 69414 17 58651 97 251779 23 758.8 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
512 20048 91 259694 99 5846 18 18707 84 338671 99 2935 9
convm004,16G,51346,98,134822,56,69414,17,58651,97,251779,23,758.8,1,512,20048,91,259694,99,5846,18,18707,84,338671,99,2935,9
so let's see what PostgreSQL is able to do in terms of sequential IO on such device:
bench=# select version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 8.3devel on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-52)
(1 row)
bench=#
simple sequential scan on a large table (pgbench schema generated with a scale of 10000) using only a single controller (same setup as in test case 3):
bench=# select count(1) from accounts;
count
------------
1000000000
(1 row)
Time: 619865.939 ms
bench=# select pg_relation_size('accounts')/619::float;
?column?
------------------
216998258.558966
(1 row)
so we are getting about 215MB/s out of 250MB/s which looks ok.
so what happens with software raid 0 over two 8 disk RAID10 volume groups on different controllers (same setup as test case 1):
bench=# select count(1) from accounts;
count
------------
1000000000
(1 row)
Time: 478785.617 ms
bench=# select pg_relation_size('accounts')/478::float;
?column?
------------------
281008205.121339
(1 row)
Time: 265.791 ms
so that is more interesting - it seems that PostgreSQL is getting CPU bottlenecked(the array/file system can do >370MB/s) here and those ~280MB/s are pretty much in line with what Luke usually quotes (PostgreSQL getting CPU bottlenecked at around 300MB/s even on very fast AMD Opteron based boxes).
for those curious here are some other random tests (uncommented so judge by yourself):
single raid 5 with 4 logical volumes (each 500GB) and software RAID0 in the OS - two volumes per channel
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
convm004 16G 50978 99 208098 82 89274 25 59058 98 236993 24 488.5 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
512 20860 94 258368 99 7770 25 21087 95 335918 100 4291 14
convm004,16G,50978,99,208098,82,89274,25,59058,98,236993,24,488.5,1,512,20860,94,258368,99,7770,25,21087,95,335918,100,4291,14
A single RAID5 array over all 16 disks and two identically sized logical volumes each around 1TB in size.
bonnie++:
on one LUN:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
convm004 16G 51245 98 121190 49 69406 17 56902 94 256111 22 840.9 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
512 20781 94 235507 91 7233 23 18685 84 338017 99 4125 14
using both LUNs and software RAID0:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
convm004 16G 51423 99 204881 84 83740 23 59040 98 232573 23 554.7 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
512 20303 93 259481 99 7230 23 20357 93 337312 100 3793 13
convm004,16G,51423,99,204881,84,83740,23,59040,98,232573,23,554.7,1,512,20303,93,259481,99,7230,23,20357,93,337312,100,3793,13
with disabled write cache mirroring:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
convm004 16G 51751 99 242637 97 105392 30 58859 98 235541 23 563.2 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
512 21034 95 255377 99 6485 21 19269 88 337095 99 4119 14
convm004,16G,51751,99,242637,97,105392,30,58859,98,235541,23,563.2,1,512,21034,95,255377,99,6485,21,19269,88,337095,99,4119,14
I just got booked for going to pgday in Prato, Italy. Looks like it's going to be a great gathering of the European people in the PostgreSQL community. Really looking forward to meeting those from the EU group that I haven't already had a chance to meet.
Tracked: May 16, 21:07