Virtuoso Cluster

Details

Virtuoso Data Space Bot

Burlington, United States

FOAF

We often get questions on clustering support, especially around RDF, where databases quickly get rather large. So we will answer them here.

But first on some support technology. We have an entire new disk allocation and IO system. It is basically operational but needs some further tuning. It offers much better locality and much better sequential access speeds.

Specially for dealing with large RDF databases, we will introduce data compression. We have over the years looked at different key compression possibilities but have never been very excited by them since thy complicate random access to index pages and make for longer execution paths, require scraping data for one logical thing from many places, and so on. Anyway, now we will compress pages before writing them to disk, so the cache is in machine byte order and alignment and disk is compressed. Since multiple processors are commonplace on servers, they can well be used for compression, that being such a nicely local operation, all in cache and requiring no serialization with other things.

Of course, what was fixed length now becomes variable length, but if the compression ratio is fairly constant, we reserve space for the expected compressed size, and deal with the rare overflows separately. So no complicated shifting data around when something grows.

Once we are done with this, this could well be a separate intermediate release.

Now about clusters. We have for a long time had various plans for clusters but have not seen the immediate need for execution. With the rapid growth in the Linking Open Data movement and questions on web scale knowledge systems, it is time to get going.

How will it work? Virtuoso remains a generic DBMS, thus the clustering support is an across the board feature, not something for RDF only. So we can join Oracle, IBM DB2, and others at the multi-terabyte TPC races.

We introduce hash partitioning at the index level and allow for redundancy, where multiple nodes can serve the same partition, allowing for load balancing read and replacement of failing nodes and growth of cluster without interruption of service.

The SQL compiler, SPARQL, and database engine all stay the same. There is a little change in the SQL run time, not so different from what we do with remote databases at present in the context of our virtual database federation. There is a little extra complexity for distributed deadlock detection and sometimes multiple threads per transaction. We remember that one RPC round trip Is 3-4 index lookups, so we pipeline things so as to move requests in batches, a few dozen at a time.

The cluster support will be in the same executable and will be enabled by configuration file settings. Administration is limited to one node, but Web and SQL clients can connect to any node and see the same data. There is no balancing between storage and control nodes because clients can simply be allocated round robin for statistically even usage. In relational applications, as exemplified by TPC-C, if one partitions by fields with an application meaning (such as warehouse ID), and if clients have an affinity to a particular chunk of data, they will of course preferentially connect to nodes hosting this data. With RDF, such affinity is unlikely, so nodes are basically interchangeable.

In practice, we develop in June and July. Then we can rent a supercomputer maybe from Amazon EC2 and experiment away.

We should just come up with a name for this. Maybe something astronomical, like star cluster. Big, bright but in this case not far away.

Comments

Re:Virtuoso Cluster

I'm looking forward to have some new about this clustering project. I am wondering how you envision the aggregation of RDF results, after a RDF query : if each repository gives a bit of answer, would you consider to aggregate the results ?
Of course I am assuming a distributed environement rather than a clustered one where each node is a copy of the other for : that is to say each repository may contain different triples.

Posted by semiosys on 07/07/2007 09:27 GMT-0500

Comments URL for this entry: http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1201

Post Comment

Name

OpenID

Comment

Remember my details

Notify me on future updates

Issue Semantic Pingback

Notify everybody mentioned in the post

Contains Markup

To verify your request please specify the result of

8 + 2 =

Subscribe to an RSS feed of this comment thread:

OpenLink Virtuoso (Product Blog)

SQL, SPARQL, RDF, XQuery, XPath, XSLT, XML, and more..

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Comments

Post Comment

Blog Roll OPML OCS

Documentation (Atom Feed) OPML OCS

Documentation (RDF Feed) OPML OCS

Online Demos & Tutorials OPML OCS

Online Documentation OPML OCS

Support OPML OCS

OpenLink Virtuoso (Product Blog)

SQL, SPARQL, RDF, XQuery, XPath, XSLT, XML, and more..

Details

Subscribe

Tag Cloud

Post Categories

Recent Articles

Related

Comments

Post Comment

Blog Roll OPML OCS

Documentation (Atom Feed) OPML OCS

Documentation (RDF Feed) OPML OCS

Online Demos & Tutorials OPML OCS

Online Documentation OPML OCS

Support OPML OCS