Comparison of Open Source Databases with TPC D Queries

Last time we talked about database engine and transactions. Now we have come to the realm of query processing in our revisiting of the DBMS side of Virtuoso.

Now the well established, respectable standard benchmark for the basics of query processing is TPC D with its derivatives H and R. So we have, for testing how different SQL optimizers manage the 22 queries, run a mini version of the D queries with a 1% scale database, some 30M of data, all in memory. This basically catches whether SQL implementations miss some of the expected tricks and how efficient in memory loop and hash joins and aggregation are.

When we get to our next stop, high volume I/O, we will run the same with D databases in the 10G ballpark.

The databases were tested on the same machine, with warm cache, taking the best run of 3. All had full statistics and were running with read committed isolation, where applicable. The data was generated using the procedures from the Virtuoso test suite. The Virtuoso version tested was 5.0, to be released shortly. The MySQL was 5.0.27, the PostgreSQL was 8.1.6.

Query Query Times in Milliseconds
Virtuoso PostgreSQL MySQL MySQL with InnoDB
Q1 206 763 312 198
Q2 4 6 3 3
Q3 13 51 254 64
Q4 4 16 24 60
Q5 15 22 64 68
Q6 9 70 189 65
Q7 52 143 211 84
Q8 29 31 13 11
Q9 36 114 97 61
Q10 32 51 117 57
Q11 16 9 12 10
Q12 8 21 18 130
Q13 18 74 - -
Q14 7 21 418 1425
Q15 14 43 389 122
Q16 16 22 18 25
Q17 1 54 26 10
Q18 82 120 - -
Q19 19 8 2 17
Q20 7 15 66 52
Q21 34 86 524 278
Q22 4 323 3311 805
Total (msec) 626 2063 6068 3545

We lead by a fair margin but MySQL is hampered by obviously getting some execution plans wrong and not doing Q13 and Q18 at all, at least not under several tens of seconds; so we left these out of the table in the interest of having comparable totals.

As usual, we also ran the workload on Oracle 10g R2. Since Oracle does not like their numbers being published without explicit approval, we will just say that we are even with them within the parameters described above. Oracle has a more efficient decimal type so it wins where that is central, as on Q1. Also it seems to notice that the GROUP BYs of Q18 are produced in order of grouping columns, so it needs no intermediate table for storing the aggregates. If we addressed these matters, we'd lead by some 15% whereas now we are even. A faster decimal arithmetic implementation may be in the release after next.

In the next posts, we will look at IO and disk allocation, and also return to RDF and LUBM.