Around 2009, we set out to find out whether one can do analytics without a schema. The rationale is of course that data without query is not useful and that large data often has variation of schema, whether due to multiple sources, multiple versions of schema at different points in history, changes in applications and business model, or any number of other factors.

In order to meaningfully answer this question, one has to have a top notch analytics engine. Comparing "schema" and "no schema" on anything except the state-of-the-art in analytics database is not interesting. So we made Virtuoso 7 and its Column Store Module, and implemented the features common in dedicated analytics databases, plus suitable SPARQL adaptations of same.

The present beach head is the Star Schema Benchmark (SSB) (PDF), which represents the core of most data warehouse workloads (i.e., big scans, selective hash joins, and aggregation). These same patterns are also found at the core of TPC-H and the new TPC-DS.

At present, the cost of having no schema is a 2.7x increase in run time, as evidenced by SSB. Virtuoso SQL does the run in 8.4s, MonetDB SQL in 17s, Virtuoso SPARQL in 22.5s. Virtuoso outperforms column store pioneer MonetDB by a fair margin. MonetDB is probably the fastest open source column store, although there exists faster ones in closed source.

SPARQL in Virtuoso comes close behind SQL in MonetDB, only a factor of 1.3. MySQL with InnoDB, which is not an analytical database, does the run in 2391s.

Initially, we were aiming at a slowdown factor of 2 when comparing SPARQL to SQL on Virtuoso.  Of course, this is only worth something if the SQL performance is generally on a level with relational column stores.  These goals have now been substantially attained.

This is a major beachhead for RDF as a branch of graph database technology, and for Virtuoso as a product.  For the first time one can run a real database workload based on SPARQL at a speed that is comparable with SQL, and without any compromises on the schema flexibility that is the principal reason to use an RDF-based graph model in the first place.  No SPARQL-to-SQL mapping, property tables, or such.

Execution and Query Optimization

The technical accomplishment is divided in two parts. For query execution, there is a compressed column store with vectored execution and a good implementation of query parallelization, hash join, and aggregation. With a schema, there is a multicolumn table. and with RDF, there is a quads table. Both are compressed based on the characteristics of actual data. The SQL execution consists of a scan of the fact table, taking a few columns of this and applying one or more hash joins to column values, where the hash joins are usually selective. This is the same in SPARQL, except that instead of getting another column of the table there is a self-join. The self-join has a fairly dense access pattern and is in the same order as the previous one, thus it is not very expensive. This constitutes the difference.

While in SQL, the plan is obvious, especially since there are no indices on the fact table; however, getting the right plan with SPARQL is quite difficult. There are up to 12 triple-patterns in a query, leading to 12! (twelve-factorial, i.e., 12*11*...*2*1 = 479,001,600) possible join orders, multiplied by index and hash based variants of each join, where the hash-based variants further multiply the space by considering different combinations of patterns on the build side. Experiments confirm that the same plan is best for both SQL and SPARQL but getting this as the outcome of query optimization is far from self-evident. This requires a very precise cost model that correctly takes ordering of intermediate results and density of hits in index lookups into account, as well as the variability in hash join performance when the hash table size varies, i.e., CPU cache effects. All this modeling is of course also valid for SQL, but is not really required there because a much coarser model will also deliver the right plan, as there is much less choice. Further, making the right plan must be fast. Now, the longest query optimization time is 240ms for an execution time of 3.7s (Q12 at 30G in SPARQL). This is pretty good.

Using hash join in SPARQL has anyway been problematical because knowing when to use one requires high confidence in the cost model. If one ends up building a large hash table that is used only a few times, there is a steep penalty. Index-based plans, especially since RDF data tends to be indexed in both S-to-O and O-to-S directions, do not have bad worst-cases, but then a good hash-based plan is easily 5x faster than an index-based one.

So, for SPARQL the results are game changing. Finally up-to-date database. For SQL, Virtuoso 7 performs like an analytics column store is supposed to, but then that is what it is. For the SQL space, it is interesting that this is also open source, so Virtuoso may well be the best performing open source SQL analytics engine out there. We will later see how the comparison with other SQL column stores goes.

In summary, the central core of the Virtuoso 7 agenda has been accomplished. Incremental progress will continue around addressing more complex benchmarks like TPC-H and TPC-DS, both in SQL and in SPARQL translation, plus full-scale Open Streetmap in both SQL and SPARQL, and of course the benchmarks being developed in LDBC.

Experiments

The scale of the dataset is 30GB, with 187M line-order rows. This comes to 3.2Gt (3.2 billion triples). The runs are from warm cache on all systems. The test system for all runs is a dual Xeon E5-2630 with 192GB RAM.

The scripts for duplicating the experiment on the Virtuoso 7 Open Source cut will be published later on this blog, when the open source cut incorporates the query optimization improvements discussed herein.

In the table below, the times are elapsed real times in milliseconds, with one query at a time, all from warm cache (i.e. the second run of the query set is reported and we make sure that all databases are running from memory). The MySQL is configured with InnoDB and 40GB of buffer pool, which should be enough for the 30GB dataset. The SQL versions do not declare any explicit indices, but do declare primary keys. The SPARQL version is with the default Virtuoso index scheme, but the query plans end up scanning each predicate in order of S, except for the build phases of hash-joins where an index from O-to-S is occasionally used.

Comparative Results, 30GB dataset
Virtuoso SQL Vs. Virtuoso SPARQL Vs. MonetDB SQL Vs. MySQL SQL
Query Virtuoso
SQL
Virtuoso
SPARQL
MonetDB
SQL
MySQL
SQL
Q1 413 1101 1659 82477
Q2 282 416 500 74436
Q3 253 295 494 75411
Q4 828 2484 958 226604
Q5 837 1915 648 222782
Q6 419 1813 541 219656
Q7 1062 2330 5658 237730
Q8 617 2182 521 194918
Q9 547 1290 381 186112
Q10 499 639 370 186123
Q11 1132 2142 2760 241045
Q12 863 3770 2127 241439
Q13 653 1612 1005 202817

One could argue that comparing against MySQL is unjustified, as MySQL is certainly not optimized for this workload, e.g., it does not do hash joins and does not parallelize queries. On the other hand, the previous work on SSBM for SPARQL, No Size Fits All by Benedikt Kaempgen and Andreas Harth, published at the 2013 ESWC, did make the comparison between MySQL and Virtuoso 6; thus we think it informative to include MySQL. To summarize No Size Fits All, SPARQL in Virtuoso 6 lost by a factor of 12 against MySQL, but in the present case, Virtuoso 7 SPARQL wins by a factor of 106 against MySQL.

The ESWC paper used a scale of 1G, while the present test uses a scale of 30G. One should remember that the MySQL times are single-threaded, and all other times are multi-threaded. The test system has 12 cores and 24 threads, so running at full platform utilization is at best 16x faster than single threaded. Running SPARQL single-threaded instead of 24 threads-per-query gives a total time of 175s, still over 10x better than MySQL. Compared to the multi-threaded time of 22.5s, the parallelism yields an average acceleration of 7.7x. Running with SPARQL with full threading but no hash-join gives a time of 80s, 3.5x worse than with hash-join. We note that SSB is a very hash-join intensive workload.

MonetDB is the more relevant comparison, as it does use the full CPU, with load peaks up to the theoretical 2400% (12 dual-threaded cores) and is the platform on which a lot of the science of the hash join was refined. MonetDB does relatively best with queries that can start by a very selective join (e.g., Q9) where it outperforms Virtuoso. This is probably due to more even splitting of the work among threads and to not having to deal with data compression. Virtuoso wins the most on queries that select a large fraction of the fact table, where MonetDB is penalized due to its policy of full materialization of intermediate results. Joins without any selection (e.g., adding up the lo_extendedprice, and grouping by the d_year of lo_orderdate) show MonetDB at its worst. Such queries do not occur in SSB though, so SSB is a relatively MonetDB-friendly benchmark.

Conclusions

The back of the problem is broken. Big query without schema can be done. One could of course see this coming by experimenting with explicit plans. But nobody out there can manually optimize a query plan. Thus the final step consisted of having a good-enough cost model and a smart-enough search order to get the right plan fast. This will be in the next update of Virtuoso Open Source. At that time, we will publish the full queries and configuration files.

This is the breakthrough for RDF analytics. Incremental progress will follow, with more tricks being incorporated, like the ones known to be needed by TPC-H.