Details

Virtuoso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Retrospective and Outlook for 2008
Retrospective and Outlook for 2008

At this close of the year, I'll give a little recap over the past year in terms of Virtuoso development and a look at where we are headed for 2008.  

A year ago, I was in the middle of redoing the Virtuoso database engine for better SMP performance.  We redid the way traversal of index structures and cache buffers was serialized for SMP and generally compared Virtuoso and Oracle engines function by function.  We had just returned from the ISWC 2006 in Athens, Georgia and the Virtuoso database was becoming a usable triple store.

Soon thereafter, we comfirmed that all this worked when we put out the first cut of Dbpedia with Chris Bizer et al and were working with Alan Ruttenberg on what would become the Banff health care and life sciences demo.

The WWW 2007 conference in Banff, Canada, was a sort of kick-off for the Linking Open Data movement, which started as a community project under SWEO, the W3C interest group for Semantic Web Education and Outreach, and has gained a life of its own since.

Right after WWW 2007 the Virtuoso development effort split on two tracks, one for enhancing the then new 5.0 release and one for building a new generation of Virtuoso, notably featuring clustering and double storage density for RDF.

The first track produced constant improvements to the relational to RDF mapping functionality, SPARQL enhancements, Redland, Jena and Sesame compatible client libraries with Virtuoso as as a triple store.  These things have been out with testers for a while and are all generally available as of this writing.  

The second track started with adding key compression to the storage engine, specifically with regard to RDF, even though there are some gains in relational applications as well.  With RDF, the space consumption drops to about half, all without recourse to any non-random access compatible compression like gzip.  Since the start of August, we turned to clustering and are now code complete, pretty much with all the tricks one would expect, of course full function SQL and taking advantage of colocated joins and doing aggregation and generally all possible processing where the data is.  I have covered details of this along the way in previous posts.  The key ppoint is that now the thing is written and works with test cases.

 

In late October, we were at the W3C workshop for mapping relational data to RDF.  For us, this confirmed the importance of mapping and scalability in general.  Ivan Herman proposed forming a W3C incubator group on benchmarking.  Also a W3C incubator group of relational to RDF mapping is being formed. 

Now, scalability has two sides.  One is dealing with volume and the other is dealing with complexity.  Volume alone will not help if interesting queries cannot be formulated.  Hence we recently extended SPARQL with subqueries so that we can now express at least any SQL workloads, which was previously not the case.  It is sort of a contradiction in terms to say that SPARQL is the universal language for information integration while not being able to express for example the TPC H queries.  Well, we fixed this.  A separate post will jhighlight how.  The W3C process will eventually follow, as the necessity of these things is undeniable, on the unimpeachable authority of the whole SQL world.  Anyway, for now, SPARQL as it is ought to become a recommendation and extensions can be addressed later.

For now, the only RDF benchmark that seems to be out there is the loading part of the LUBM.  We did a couple of enhancements of our own for that just recently but much bigger things are on the way.  Also, the billion triples challenge is an interesting initiative in the area.  We all recognize that loading any number of triples is a finite problem with known solutions.  The challenge is running interesting queries on large volumes. 

Our present emphasis is demonstrating both RDF data warehousing and RDF mapping with complex queries and large data.  We start with the TPC H benchmark and doing the queries both through mapping to SQL against any RDBMS, Oracle, DB2, Virtuoso or other, and by querying the physical RDF rendition of the data in Virtuoso.  From there, we move to querying a collection of RDBMS's hosting similar data.

Doing this with performance at the level of direct SQL in the case of mappping and not very much slower with physical triples is an important milestone on the way to real world enterprise data web.  Real life has harder and more unexpected issues than a benchmark but at any rate doing the benchmark without breaking a sweat is a step on the way.  We sent a paper to ESWC 2008 about that but it was rather incomplete.  By the time of the VLDB submissions deadline in March we'll have more meat.

Another tack soon to start is a rearchitecting of Zitgist around clustered Virtuoso.  Aside matters of scale, we will make a number of qualitatively new things possible.  Again, more will be released in the first quarter of 08.

Beyond these short and mid-term goals we have the introduction of entirely dynamic and demand driven partitioning, a la Google Bigtable or Amazon Dynamo.  Now, regular partitioning will do for a while yet but this is the future when we move the the vision of linked dataeverywhere.

In conclusion, this year we have built the basis and the next year is about deployment.  The bulk of really new development is behind us and now we start applying.  Also, the community will find adoption easier due to our recent support of the common RDF API's.

 
# PermaLink Comments [0]
12/18/2007 07:22 GMT-0500
More on RDF and Vertical Storage
We actually did the experiment I mentioned a couple of posts back, about storing RDF triples column-wise.

The test loads 4.8 million triples of LUBM data and reads the whole set on one index and then checks if it finds the same row on another index.

Reading GSPO and checking OGPS takes 27 seconds.  Doing the same with column wise bitmap indices on S, G, P and O takes 86 seconds.   The latter checks the existence of the row by AND'ing 4 bitmap indices and the former checks its existence by a single lookup in a multi-part index whose last part is a bitmap.  The result is approximately what one would expect.  The bitmap AND could be optimized a bit, dropping the time to maybe 70 seconds. 

Now speaking of compression, it is true that column storage will work better.  For example the G and P columns will compress to pretty much nothing.  On a row layout they compress too but not to nothing since even if a value is not unique you have to store the place where the value is if you want to read rows in constant time per row.

What is nice with the 4 bitmaps is that no combination of search conditions is penalized.  But the trick of using bitmaps for self-join is lost:  You can't evaluate {?s a Person . ?s name "Mary"} by and'ing the S bitmaps for persons and for subjects named "Mary".

The 4 bitmap indices are remarkably compact, though. 8840 pages all together.
We could probably get the G, S, P, O columns in 3000 pages or so, using very little  compression.
The OGPS index is   5169 pages and the GSPO index is 21243 pages.

None of the figures have any compression, except what a bitmap naturally produces.

Now we have figured out a modified row layout which will about double working set with the same memory and keep things in rows.  We will try that.  The GSPO index will be about  10000 pages and OGPS will be about 4500.  We do not expect much impact on search or insert times.

We looked at using gzip for database pages.  They go to between 1/4 to 1/3 page.   But this does not improve working set and having variable length pages generates all kinds of special cases you don’tt want.  So we will improve working set first and deal with somewhat compressed data in the execution engine.
After that, maybe gzip will cut the size to 1/2 or so but  that will be good for disk only.  And it does not so much matter how much you transfer but how many seeks you do.

Still, column-wise storage will likely win for size.  So if the working set is much larger than memory this may have an edge.  To keep all bases covered we will eventually add this as an option.
 

| | | ||


# PermaLink Comments [0]
06/11/2007 04:35 GMT-0500 Modified: 06/11/2007 04:36 GMT-0500
Announcing Virtuoso Open-Source Edition v5.0.0
All, OpenLink Software are pleased to announce a new release of Virtuoso, Open-Source Edition, version 5.0.0. This version includes:
  • Significant rewrite of database engine resulting in 50%-100% improvement on single CPU and in some cases up to 300% on multiprocessor CPUs by decreasing resource-contention between threads and other optimizations.
  • Radical expansion of RDF support including
  • In-built middleware (called the Sponger) for transforming non-RDF into RDF "on the fly" (e.g. producing Triples from Microformats, REST-style Web Services, and (X)HTML etc.)
  • Full Text Indexing of Literal Objects in Triple Patterns (via Filter or magic bif:contains predicate applied to Literal Objects)
  • Basic Inferencing (Subclass and Subproperty Support)
  • SPARQL Aggregate Functions
  • SPARQL Update Language Support (Updates, Inserts, Deletions in SPARQL)
  • Improved Support of XML Schema Type System (including the use of XML Schema Complex Types as Objects of bif:xcontains predicate)
  • Enhancements to the in-built SPARQL to SQL Compiler's Cost Optimizer
  • Performance Optimizations to RDF VIEWs (SQL to RDF Mapping)
  • Various bug-fixes
NOTE: Databases created with earlier versions of Virtuoso will be automatically upgraded to Virtuoso 5.0 but after upgrade will not be readable with older Virtuoso versions. For more information please see: Virtuoso Open Source Edition: Home Page: http://virtuoso.openlinksw.com/wiki/main/ Download Page: http://virtuoso.openlinksw.com/wiki/main/Main/VOSDownload OpenLink Data Spaces: Home Page: http://virtuoso.openlinksw.com/wiki/main/Main/OdsIndex SPARQL Usage Examples (re. SIOC, FOAF, AtomOWL, SKOS): http://virtuoso.openlinksw.com/wiki/main/Main/ODSSIOCRef Interactive SPARQL Demo: http://demo.openlinksw.com/isparql/ OpenLink AJAX Toolkit (OAT): Project Page: http://sourceforge.net/projects/oat Live Demonstration: http://demo.openlinksw.com/DAV/JS/oat/index.html

Technorati Tags: , , , , , , , , ,

# PermaLink Comments [0]
04/12/2007 13:48 GMT-0500 Modified: 04/12/2007 09:50 GMT-0500
Virtuoso Open Source 5.0 Release Imminent
Virtuoso Open Source 5.0 Release Imminent

We are a couple of days from releasing the Virtuoso Open Source 5.0 cut. This will make the technology that we are showing with Dbpedia and the various OpenLink web sites available to the public.

The updates involve:

  • Significant database engine improvements, as discussed in previous posts.
  • Tons of RDF related bug fixes.
  • Text index extension to SPARQL
  • New SQL data type capturing the whole XML Schema scalar type system used in RDF.

Soon to follow are:

  • Basic inference for RDF, including type and property subsumption.
  • Whole new disk IO system with much better disk locality.

Existing databases will be automatically upgraded when started with the new Virtuoso 5.0 server. Note that after upgrade, the RDF data is not backward compatible.

We will be rolling out more Virtuoso hosted semantic web content in the Linking Open Data project, part of our participation in the Semantic Web Education and Outreach activity at W3C.

| | | ||

# PermaLink Comments [0]
03/16/2007 05:55 GMT-0500
Recent Virtuoso Developments
Recent Virtuoso Developments

We have been extensively working on virtual database refinements.  There aremany SQL cost model adjustments to better model  distributed queries and wenow support direct access to Oracle and Informix statistics system tables.Thus, when you attach  a table from one or the other, you automatically getup to date statistics.  This helps Virtuoso optimize distributed  queries.Also the documentation is updated as concerns these, with a new section ondistributed query optimization.

On the applications side, we have been keeping up with the SIOC RDF ontologydevelopments.  All ODS applications now make  their data available as SIOCgraphs for download and SPARQL query access.

What is most exciting however is our advance in mapping relational data intoRDF.  We now have a mapping language that makes  arbitrary legacy data in Virtuoso or elsewhere in the relational world RDF queriable.  We will putout a white paper on  this in a few days.

Also we have some innovations in mind for optimizing the physical storage ofRDF triples.  We keep experimenting, now with  our sights set to the highend of triple storage, towards billion triple data sets.  We areexperimenting with a new more space efficient index structure  for betterworking set behavior.  Next week will yield the first results.

# PermaLink Comments [0]
09/19/2006 07:45 GMT-0500
         
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006