I was invited to give a keynote at SEMANTiCS 2014 in Leipzig, Germany last Thursday. I will here recap some of the main points, and comment on some of the ensuing controversy. The talk was initially titled Virtuoso, the Prometheus of RDF. Well, mythical Prometheus did perform a service but ended up paying for it. Still, the mythical reference is sometimes used when talking of major breakthroughs and big-gain ambitions. In the first slide, I changed it to Linked Data at Dawn, which is less product specific and more a reflection on the state of the linked data enterprise at large.

The first part of the talk was under the heading of the promise and the practice. The promise we know well and find no fault with: Schema-last-ness, persistent unique identifiers, self-describing data, some but not too much inference. The applications usually involve some form of integration and often have a mix of strictly structured content with semi-structured or textual content.

These values are by now uncontroversial and embraced by many; however, most instances of this embracing do not occur in the context of RDF as such. For example, the big online systems on the web: all have some schema-last (key-value) functionality. Applications involving long-term data retention have diverse means of having persistent IDs and self description, from UUIDs to having the table name in a column so that one can tell where a CSV dump came from.

The practice involves competing with diverse alternative technologies: SQL, key-value, information retrieval (often Lucene-derived). In some instances, graph databases occur as alternatives: Young semanticist, do or die.

In this race, linked data is often the prettiest and most flexible, but gets a hit on different aspects of performance and scalability. This is a database gig, and database is a performance game; make no mistake.

After these preliminaries we come to the "RDF tax," or the more or less intrinsic overheads of describing all as triples. The word "triple" is used by habit. In fact, we nearly always talk about quads, i.e., subject-predicate-object-graph (SPOG). The next slide is provocatively titled the Bane of the Triple, and is about why having all as triples is, on the surface, much like relational, except it makes life hard, where tables make it at least manageable, if still not altogether trivial.

The very first statement on the tax slide reads "90% of bad performance comes from non-optimal query plans." If one does triples in the customary way (i.e., a table of quads plus dictionary tables to map URIs and literal strings to internal IDs), one incurs certain fixed costs.

These costs are deemed acceptable by users who deploy linked data. If these costs were not acceptable, the proof of concept would have already disqualified linked data.

The support cases that come my way are nearly always about things taking too much time. Much less frequently, are these about something unambiguously not working. Database has well defined semantics, so whether something works or not is clear cut.

So, support cases are overwhelmingly about query optimization. The problems fall in two categories:

  • The plan is good in the end, but it takes much longer to make the plan than to execute it.
  • The plan either does the wrong things or does things in the wrong order, but produces a correct result.

Getting no plan at all or getting a clearly wrong result is much less frequent.

If the RDF overheads incurred with a good query plan were show stoppers, the show would have already stopped.

So, let's look at this in more detail; then we will talk about the fixed overheads.

The join selectivity of triple patterns is correlated. Some properties occur together all the time; some occur rarely; some not at all. Some property values can be correlated, i.e., order number and order date. Capturing these by sampling in a multicolumn table is easy; capturing this in triples would require doing the join in the cost model, which is not done since it would further extend compilation times. When everything is a join, selectivity estimation errors build up fast. When everything is a join, the space of possible graph query plans explodes as opposed to tables; thus, while the full plan space can be covered with 7 tables, it cannot be covered with 18 triple patterns. This is not factorial (number of permutations). For different join types (index/hash) and the different compositions of the hash build side, this is much worse, in some nameless outer space fringe of non-polynomiality.

TPC-H can be run with success because the cost model hits the right plan every time. The primary reason for this is the fact that the schema and queries unambiguously suggest the structure, even without foreign key declarations. The other reason is that with a handful of tables, all plans can be reviewed, and the cost model reliably tells how many rows will result from each sequence of operations.

Try this with triples; you will know what I mean.

Now, some people have suggested purely rule-based models of SPARQL query compilation. These are arguably faster to run and more predictable. But the thing that must be done, yet will not be done with these, is the right trade-off between index and hash. This is the crux of the matter, and without this, one can forget about anything but lookups. The choice depends on reliable estimation of cardinality (number of rows, number of distinct keys) on either side of the join. Quantity, not pattern matching.

Well, many linked data applications are lookups. The graph database API world is sometimes attractive because it gives manual control. Map reduce in the analytical space is sometimes attractive for the same reason.

On the other hand, query languages also give manual control, but then this depends on system specific hints and cheats. People are often black and white: Either all declarative or all imperative. We stand for declarative, but still allow physical control of plan, like most DBMS.

To round off, I will give a concrete example:

{  ?thing  rdfs:label    ?lbl         . 
   ?thing  dc:title      ?title       . 
   ?lbl    bif:contains  "gizmo"      . 
   ?title  bif:contains  "widget"     . 
   ?thing  a             xx:Document  . 
   ?thing  dc:date       ?dt          . 
   FILTER  ( ?dt  > "2014-01-01"^^xsd:date ) 
}

There are two full text conditions, one date, and one class, all on the same subject. How do you do this? Most selective text first, then get the data and check, then check the second full text given the literal and the condition, then check the class? Wrong. If widgets and gizmos are both frequent and most documents new, this is very bad because using a text index to check for a specific ID having a specific string is not easily vectorable. So, the right plan is: Take the more selective text expression, then check the date and class for the results, put the ?things in a hash table. Then do the less selective text condition, and drop the ones that are not in the hash table. Easily 10x better. Simple? In the end yes, but you do not know this unless you know the quantities.

This gives the general flavor of the problem. Doing this with TPC-H in RDF is way harder, but you catch my drift.

Each individual instance is do-able. Having closer and closer alignment between reality and prediction will improve the situation indefinitely, but since the space is as good as infinite there cannot be a guarantee of optimality except for toy cases.

The Gordian Knot shall not be defeated with pincers but by the sword.

We will come to this in a bit.

Now, let us talk of the fixed overheads. The embarrassments are in the query optimization domain; the daily grind, relative cost, and provisioning are in this one.

The overheads come from:

  • Indexing everything
  • Having literals and URI strings via dictionary
  • Having a join for every attribute

These all fall under the category of having little to no physical design room.

In the indexing everything department, we load 100 GB TPC-H in 15 minutes in SQL with ordering only on primary keys and almost no other indexing. The equivalent with triples is around 12 hours. This data can be found on this blog (TPC-H series and Meeting the Challenges of Linked Data in the Enterprise). This is on the order of confusing a screwdriver with a hammer. If the nail is not too big, the wood not too hard, and you hit it just right — the nail might still go in. The RDF bulk load is close to the fastest possible given the general constraints of what it does. The same logic is used for the record-breaking 15 minutes of TPC-H bulk load, so the code is good. But indexing everything is just silly.

The second, namely the dictionary of URIs and literals, is a dual edge. I talked to Bryan Thompson of SYSTAP (Bigdata RDF store) in D.C. at the ICDE there. He said that they do short strings inline and long ones via dictionary. I said we used to do the same but stopped in the interest of better compression. What is best depends on workload and working-set-to-memory ratio. But if you must make the choice once and for all, or at least as a database-wide global setting, you are between a rock and a hard place. Physical vs. logical design, again.

The other aspect of this is the applications that do regexps on URI strings or literals. Doing this is like driving a Formula 1 race in reverse gear. Use a text index. Always. This is why most implementations have one even though SPARQL itself makes no provisions for this. If you really need regexps, and on supposedly opaque URIs at that, tokenize them and put them in a text index as a text literal. Or if an inverted-file-word index is really not what you need, use a trigram one. So far, nobody has wanted one hard enough for us to offer this, even though this is easy enough. But special indices for special data types (e.g., chemical structure) are sometimes wanted, and we have a generic solution for all this, to be introduced shortly on this blog. Again, physical design.

I deliberately name the self-join-per-attribute point last, even though this is often the first and only intrinsic overhead that is named. True, if the physical model is triples, each attribute is a join against the triple table. Vectored execution and right use of hash-join help, though. The Star Schema Benchmark SQL to SPARQL gap is only 2.5x, as documented last year on this blog. This makes SPARQL win by 100+x against MySQL and lose by only 0.8x against column store pioneer MonetDB. Let it be said that this is so far the best case and that the gap is wider in pretty much all other cases. This gap is well and truly due to the self-join matter, even after the self-joins are done vectored, local, ordered; in one word, right. The literal and URI translation matter plays no role here. The needless indexing hurts at load but has no effect at query time, since none of the bloat participates in the running. Again, physical design.

Triples are done right, so?

In the summer of 2013, after the Star Schema results, it became clear that maybe further gains could be had and query optimization made smoother and more predictable, but that these would be paths of certain progress but with diminishing returns per effort. No, not the pincers; give me the sword. So, between fall 2013 and spring 2014, aside from doing diverse maintenance, I did the TPC-H series. This is the proficiency run for big league databases; the America's Cup, not a regatta on the semantic lake.

Even if the audience is principally Linked Data, the baseline must be that of the senior science of SQL.

It stands to reason and has been demonstrated by extensive experimentation at CWI that RDF data, by and large, has structure. This structure will carry linked data through the last mile to being a real runner against the alternative technologies (SQL, IR, key value) mentioned earlier.

The operative principles have been mentioned earlier and are set forth on the slides. In forthcoming articles I will display some results.

One important proposal for structure awareness was by Thomas Neumann in an RDF3X paper introducing characteristic sets. There, the application was creation of more predictable cost estimates. Neumann correctly saw this as possibly the greatest barrier to predictable RDF performance. Peter Boncz and I discussed the use of this for physical optimization once when driving back to Amsterdam from a LOD2 review in Luxembourg. Pham Minh Duc of CWI did much of the schema discovery research, documented in the now published LOD2 book (Linked Open Data -- Creating Knowledge Out of Interlinked Data). The initial Virtuoso implementation had to wait for the TPC-H and general squeezing of the quads model to be near complete. It will likely turn out that the greatest gain of all with structure awareness will be bringing optimization predictability to SQL levels. This will open the whole bag of tricks known to data warehousing to safe deployment for linked data. Of course, much of this has to do with exploiting physical layout; hence it also needs the physical model to be adapted. Many of these techniques have high negative impact if used in the wrong place; hence the cost model must guess right. But they work in SQL and, as per Thomas Neumann's initial vision, there is no reason why these would not do so in a schema-less model if adapted in a smart enough manner.

All this gives rise to some sociological or psychological observations. Jens Lehmann asked me why now, why not earlier; after all, over the years many people have suggested property tables and other structured representations. This is now because there is no further breakthroughs within an undifferentiated physical model.

For completeness, we must here mention other approaches to alternative, if still undifferentiated, physical models. A number of research papers mention memory-only, pointer-based (i.e., no index, no hash-join) implementations of triples or quads. Some of these are on graph processing frameworks, some stand-alone. Yarc Data is a commercial implementation that falls in this category. These may have higher top speeds than column stores, even after all vectoring and related optimizations. However the space utilization is perforce larger than with optimum column compression and this plus the requirement of 100% in memory makes these more expensive to scale. The linked data proposition is usually about integration, and this implies initially large data even if not all ends up being used.

The graph analytics, pointer-based item will be specially good for a per-application extraction, as suggested by Oracle in their paper at GRADES 13. No doubt this will come under discussion at LDBC, where Oracle Labs is now a participant.

But back to physical model. What we have in mind is relational column store — multicolumn-ordered column-wise compressed tables — a bit like Vertica and Virtuoso in SQL mode for the regular parts and quads for the rest. What is big is regular, since a big thing perforce comes from something that happens a lot, like click streams, commercial transactions, instrument readings. For the 8-lane-motorway of regular data, you get the F1 racer with the hardcore best in column store tech. When the autobahn ends and turns into the mountain trail, the engine morphs into a dirt bike.

This is complex enough, and until all the easy gains have been extracted from quads, there is little incentive. Plus this has the prerequisite of quads done right, plus the need for top of the line relational capability for not falling on your face once the speedway begins.

Steve Buxton of MarkLogic gave a talk right before mine. Coming from a document-centric world, it stands to reason that MarkLogic would have a whole continuum of different mixes between SPARQL and document oriented queries. Steve correctly observed that some users found this great; others found this a near blasphemy, an unholy heterodoxy of confusing distinct principles.

This is our experience as well, since usage of XML fragments in SPARQL with XPath and such things in Virtuoso is possible but very seldom practiced. This is not the same as MarkLogic, though, as MarkLogic is about triples-in-documents, and the Virtuoso take is more like documents-in-triples. Not to mention that use of SQL and stored procedures in Virtuoso is rare among the SPARQL users.

The whole thing about the absence of physical design in RDF is a related, but broader instance of such purism.

In my talk, I had a slide titled The Cycle of Adventure, generally philosophizing on the dynamics of innovation. All progress begins with an irritation with the status quo; to mention a few examples: the No-SQL rebellion; the rejection of parallel SQL database in favor of key-value and map-reduce; the admission that central schema authority at web scale is impossible; the anti-ACID stance when having wide-area geographies to deal with. The stage of radicalism tends to discard the baby with the bathwater. But when the purists have their own enclave, free of the noxious corruption of the rejected world, they find that life is hard and defects of human character persist, even when all subscribe to the same religion. Of course, here we may have further splinter groups. After this, the dogma adapts to reality: the truly valuable insights of the original rebellion gain in appreciation, and the extremism becomes more moderate. Finally there is integration with mainstream, which becomes enriched by new content.

By the time the term Linked Data came to broad use, the RDF enterprise had its break-away colonies that started to shed some of the initial zeal. By now, we have the last phase of reconciliation in its early stages.

This process is in principle complete when linked data is no longer a radical bet, but a technology to be routinely applied to data when the nature of the data fits the profile. The structure awareness and other technology discussed here will mostly eliminate the differential in deployment cost.

The spreading perception of an expertise gap in this domain will even-out the cost in terms of personnel. The flexibility gains that were the initial drive for the movement will be more widely enjoyed when these factors fuel broader adoption.

To help this along, we have LDBC, the Linked Data Benchmark Council, with the agenda of creating industry consensus on measuring progress across the linked data and graph DB frontiers. I duly invited MarkLogic to join.

There were many other interesting conversations at the conference, I will later comment on these.

To be continued...

SEMANTiCS 2014 Series