Friday, June 12. 2009
Computing platforms are constantly changing, evolving and improving. This holds true for both the hardware and the software running on that hardware.
Benchmarking is always a difficult and complex matter any whatever you are doing you are likely to make mistakes or bias (on purpose or not) the results into one direction or the other. However you need to have a goal when doing benchmarking.
I'm planning to do a number of different testing scenarios on that hardware - the first thing(as in this post) I'm going to deal with is read-only benchmarking using pgbench.
Unless otherwise mentioned the postgresql.conf used for this testing used the following non-default settings and the database used was UTF8. The benchmark client was running on the same box as the database itself and connecting using a unix socket.
Pgbench is the typically used off-the-shelf benchmarking tool used with postgresql. It has some scalability issues in itself(more on that later) but in general it's behaviour is well understood which makes it still a valuable testing platform.
The first chart shows pgbench in "SELECT only" mode ran for 240s each time with an increasing number of concurrency and increasing tables sizes. A scale factor of 10 corresponds to 1000000 rows in the table and the largest scale factor results in a table with 100M rows. All those fit easily into the OS buffercache so there is no (noticable) IO at all going on here.
While slightly degrading performance with an increasing table size is expected, the transaction rate drop after exceeding the number of available cores is pretty severe and likely needs some more investigation. The server itself still had plenty of idle time (around 30% at the lowest time which also corresponds to the highest transaction rates at 12/16 connections).
The following graph shows a comparision between the different protocol choices available. "simple" is the default mode for pgbench, where as "prepared" is the same as "extended" but uses prepared statements. This tests used a scale factor of 100(as in 10M rows).
Using prepared statements is clearly of great help for this kind of workload but it is worth mentioning that even in the fastest case of 115000 tps with 8 clients the server still showed ~50% of idle time. I'm not sure why the performance does not increase with 12 and 16 connections yet but it seems clear that we are looking at some scaling issue withing pgbench too.
To validate this theory I also did some testing using sysbench. The following sysbench numbers were generated using --oltp-read-only=on and --db-ps-mode=auto. A table size of 10M rows and a maximum runtime of 240 seconds was used for both pgbench and sysbench.
It is important to stress that the schema and the queries sysbench uses are more complex than what pgbench does so this is a bit of an apples to oranges comparison. The default schema for sysbench uses two char() columns which do have a noticable overhead so I added a run with those columns changed to varchar().
The sysbench results show a pretty different picture from the pgbench results. The system is using all the CPU and hitting no noticable scaling issues if the number of connections exceeds the number of cores/threads available.
Display comments as (Linear | Threaded)
Hi Stefan, Could you retest it with pgbench in another machine? As you said, the scaling problems is a known issue in pgbench. IIRC you posted this same analysis for 8.3; is it the same machine? Could you show us the comparison? Thanks for testing.
at this transaction rates (ie up to 140000 queries/s) the additional latency involved in going over tcp/ip from another box is significant and seems to hide some of the pgbench issues. However I will see if I can get some numbers as well.
given that the server has 34GB of RAM would you consider allocating more memory to shared buffers? I would try 5-7GB. This setting most likely wouldn't change the test results and I know that it's very often advised to keep shared buffers small, but given the amount of RAM available nowadays and efficiency of nehalem architecture (i.e. built-in memory controller and fasr QPI bus) I would be interested to see if more memory dedicated to shared buffers is appropriate. thanks.
The largest database test set used here (scale=100, about 1.6GB) is smaller than the 2GB shared_buffers already allocated. Giving it more RAM will just slow things down unless the working set of data is also increased.
the first chart actually contains numbers at scale=1000 which I actually did to see if shared_buffers less than the size of the database would have a noticable effect in this scenario.
To overcome pgbench scalability limit, I'd advice using tsung. Bug me when you see me there on IRC if you want me to help through setting up the client etc. With modern Erlang version (stable and all), you can have 50 000 concurrent TCP connected clients without a problem, you typically run out of TCP ports before erlang scalability level is reached, I've been told by tsung author.