<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>

<title>OpenLink Virtuoso (Product Blog)</title><link>http://virtuoso.openlinksw.com/blog/vdb/blog/</link><description>A great place to track Virtuoso&#39;s rapid evolution.</description><managingEditor>kidehen@openlinksw.com</managingEditor><pubDate>Tue, 21 May 2013 00:49:02 GMT</pubDate><generator>Virtuoso Universal Server 06.04.3135</generator><webMaster>kidehen@openlinksw.com</webMaster><image><title>OpenLink Virtuoso (Product Blog)</title><url>http://virtuoso.openlinksw.com/weblog/public/images/vbloglogo.gif</url><link>http://virtuoso.openlinksw.com/blog/vdb/blog/</link><description>A great place to track Virtuoso&#39;s rapid evolution.</description><width>88</width><height>31</height></image>
<item><title>Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1688</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1688#comments</comments><pubDate>Tue, 22 Mar 2011 22:32:28 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2011-03-22T17:04:43-04:00</n0:modified><description>&lt;p&gt;This article covers the changes we have made to the &lt;a class=&quot;auto-href&quot; href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x2361bf18&quot;&gt;BSBM&lt;/a&gt; test driver during our series of experiments.&lt;/p&gt;

&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Drill-down mode&lt;/b&gt; - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy.  The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time.  Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy.&lt;/p&gt;
 &lt;/li&gt;

&lt;li&gt;
  &lt;b&gt;Permutation of query mix&lt;/b&gt; - In the BI workload, the queries are run in a random order on each thread in multiuser mode.  Doing exactly the same thing on many threads is not realistic for large queries. The &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x2834cec8&quot;&gt;data&lt;/a&gt; access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities.  For queries with a drill-down, the individual executions that make up the drill-down are still consecutive.&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;New metrics&lt;/b&gt; - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the &lt;a class=&quot;auto-href&quot; href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x236c5158&quot;&gt;TPC&lt;/a&gt;-&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x28814950&quot;&gt;H&lt;/a&gt; Power and Throughput metrics. &lt;/p&gt;
&lt;p&gt;The &lt;i&gt;Power&lt;/i&gt; is defined as&lt;/p&gt; 
&lt;blockquote&gt;(scale_factor / 284826) *  3600 / ((t0 * t1 * ... * tn) ^(1 / n)) &lt;/blockquote&gt;
&lt;p&gt;The &lt;i&gt;Throughput&lt;/i&gt; is defined as&lt;/p&gt; 
&lt;blockquote&gt;(scale_factor / 284826) *  3600 / ((t0 + t2 + ... +  tn) / n)&lt;/blockquote&gt;
&lt;p&gt;The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt).  We consider this &amp;quot;scale one.&amp;quot;  The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.&lt;/p&gt;

&lt;p&gt;We also show the percentage each query represents from the total time the test driver waits for responses. &lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Deadlock retry&lt;/b&gt; - When running update mixes, it is possible that a transaction gets aborted by a deadlock.   We have made a retry logic for this.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Cluster mode&lt;/b&gt; - Cluster databases may have multiple interchangeable &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x240f9008&quot;&gt;HTTP&lt;/a&gt; listeners.  With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Identifying matter&lt;/b&gt; - A version number was added to test driver output.  Use of the new switches is also indicated in the test driver output.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;SUT &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Central_processing_unit&quot; id=&quot;link-id0x249b7208&quot;&gt;CPU&lt;/a&gt;&lt;/b&gt; - In comparing results it is crucial to differentiate between in memory runs and IO bound runs.  To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups).  A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too.  The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These changes will soon be available &lt;a href=&quot;http://blogs.usnet.private:8893/RPC2&quot; id=&quot;link-id0x1f9a57c0&quot;&gt;as a diff&lt;/a&gt; and &lt;a href=&quot;http://blogs.usnet.private:8893/RPC2&quot; id=&quot;link-id0x1f2fea08&quot;&gt;as a source tree&lt;/a&gt;. This version is labeled &lt;b&gt;&lt;code&gt;BSBM Test Driver 1.1-opl&lt;/code&gt;&lt;/b&gt;; the &lt;b&gt;&lt;code&gt;-opl&lt;/code&gt;&lt;/b&gt; signifies OpenLink additions.  &lt;/p&gt;

&lt;p&gt;We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver.  There is more precise documentation of these options in the README file in the above distribution.&lt;/p&gt;

&lt;p&gt;The next planned upgrade of the test driver concerns adding support for &amp;quot;&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x2865ac68&quot;&gt;RDF&lt;/a&gt;-H&amp;quot;, the RDF adaptation of the industry standard TPC-H decision support benchmark for &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x23597bb0&quot;&gt;RDBMS&lt;/a&gt;.&lt;/p&gt;



&lt;h3&gt;
&lt;i&gt;Benchmarks, Redux&lt;/i&gt; Series&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1658&quot; id=&quot;link-id0x1db2be00&quot;&gt;Benchmarks, Redux (part 1): On RDF Benchmarks&lt;/a&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1660&quot; id=&quot;link-id0x1dfcc038&quot;&gt;Benchmarks, Redux (part 2): A Benchmarking Story&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1663&quot; id=&quot;link-id0x197c26d0&quot;&gt;Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1665&quot; id=&quot;link-id0x1d149cf0&quot;&gt;Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1667&quot; id=&quot;link-id0x1ab69450&quot;&gt;Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1669&quot; id=&quot;link-id0x1e67d688&quot;&gt;Benchmarks, Redux (part 6): BSBM and I/O, continued&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1671&quot; id=&quot;link-id0x1dad87c8&quot;&gt;Benchmarks, Redux (part 7): What Does BSBM Explore Measure?&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
 &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1673&quot; id=&quot;link-id0x1cc73830&quot;&gt;Benchmarks, Redux (part 8): BSBM Explore and Update &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1675&quot; id=&quot;link-id0x1d6879a8&quot;&gt;Benchmarks, Redux (part 9): BSBM With Cluster&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1677&quot; id=&quot;link-id0x1dfae510&quot;&gt;Benchmarks, Redux (part 10): LOD2 and the Benchmark Process&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1678&quot; id=&quot;link-id0x1ef052a0&quot;&gt;Benchmarks, Redux (part 11): The Substance of Benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1dadddb0&quot;&gt;Benchmarks, Redux (part 12): Our Own BSBM Results Report&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1e662ef0&quot;&gt;Benchmarks, Redux (part 13): BSBM BI Modifications &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1df6fa70&quot;&gt;Benchmarks, Redux (part 14): BSBM BI Mix &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements &lt;i&gt;(this post)&lt;/i&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-10#1679</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1679#comments</comments><pubDate>Thu, 10 Mar 2011 23:29:41 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2011-03-14T19:37:14.000001-04:00</n0:modified><description>&lt;p&gt;I have in the previous posts generally argued for and demonstrated the usefulness of benchmarks.&lt;/p&gt;

&lt;p&gt;Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a &lt;a class=&quot;auto-href&quot; href=&quot;http://lod2.eu/&quot; id=&quot;link-id0x2acb0760&quot;&gt;LOD2&lt;/a&gt; members consensus, but have been discussed in the consortium. &lt;/p&gt;

&lt;p&gt;My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water!  But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking.  Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating &lt;i&gt;le chef d&amp;#39;oeuvre culinaire&lt;/i&gt; (&amp;quot;the culinary masterpiece&amp;quot;) create it.  Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values.  Indeed, an intimate &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x2aca6a30&quot;&gt;knowledge&lt;/a&gt; &lt;i&gt;de la vie secrete du canard&lt;/i&gt; (&amp;quot;the secret life of duck&amp;quot;) is required in order to liberate the aroma that it might take flight and soar.  In the previous, I have shed some light on how we prepare &lt;i&gt;le canard&lt;/i&gt;, and if &lt;i&gt;le canard&lt;/i&gt; be such then &lt;i&gt;la dinde&lt;/i&gt; (turkey) might in some ways be analogous; who is to say?&lt;/p&gt;

&lt;p&gt;In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice.  In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained.  This is the &lt;a class=&quot;auto-href&quot; href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x2b847818&quot;&gt;TPC&lt;/a&gt; (Transaction Processing Performance Council) model.&lt;/p&gt;

&lt;p&gt;Another culture of doing benchmarks is the periodic challenge model used in TREC, the &lt;a class=&quot;auto-href&quot; href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x2ac3a6f8&quot;&gt;Billion Triples Challenge&lt;/a&gt;, the Semantic Search
Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication.&lt;/p&gt;

&lt;p&gt;A third party performing benchmarks by itself is uncommon in databases.  Licenses even often explicitly prohibit this, for understandable reasons.&lt;/p&gt;

&lt;p&gt;The LOD2 project has an outreach activity called Publink where we offer to help owners of &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x2aea5930&quot;&gt;data&lt;/a&gt; to publish it as &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x2a790128&quot;&gt;Linked Data&lt;/a&gt;. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x29babb00&quot;&gt;RDF&lt;/a&gt; store benchmarks.&lt;/p&gt;

&lt;p&gt;One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results.  The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison.&lt;/p&gt;

&lt;p&gt;Isn&amp;#39;t this the very truth?   Let the chefs  mix their own spices.&lt;/p&gt;

&lt;p&gt;This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import.&lt;/p&gt;

&lt;p&gt;In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question.  Increasing the scale remains a stated objective.  LOD2 even promised to run things with a trillion triples in another 3 years.  &lt;/p&gt;

&lt;p&gt;Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off?  Or would this on the contrary combine strict Justice with edifying Charity?  Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice?&lt;/p&gt;

&lt;p&gt;Even better, &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science&quot; id=&quot;link-id0x2a21d108&quot;&gt;CWI&lt;/a&gt;, with its &lt;a href=&quot;http://monetdb.cwi.nl/Development/Research/Articles/&quot; id=&quot;link-id0x1d6479d0&quot;&gt;stellar database pedigree&lt;/a&gt;, agreed in principle to audit RDF benchmarks in LOD2. &lt;/p&gt;

&lt;p&gt;In this way one could get a stamp of approval for one&amp;#39;s results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs.  On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here.  I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes.&lt;/p&gt;

&lt;p&gt;We could even do this unilaterally -- just publish &lt;a class=&quot;auto-href&quot; href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x2a0d73d8&quot;&gt;Virtuoso&lt;/a&gt; results according to a predefined reporting and verification format.  If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings.  This could all take place over the &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x2a6b44a0&quot;&gt;net&lt;/a&gt;, so we are not talking about any huge cost or prohibitive amount of trouble.  It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason.&lt;/p&gt;

&lt;p&gt;Then there is the matter of the &lt;a class=&quot;auto-href&quot; href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x2a1722a8&quot;&gt;BSBM&lt;/a&gt; Business Intelligence (BI) mix.  At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer.  This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions.  Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around.  The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well.  There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it.  If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner.  &lt;/p&gt;

&lt;p&gt;(I will talk about the BI mix in more detail in &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1dfcc038&quot;&gt;part 13&lt;/a&gt; and &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1edaa388&quot;&gt;part 14&lt;/a&gt; of this series.)&lt;/p&gt;

&lt;p&gt;Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit. &lt;/p&gt;

&lt;p&gt;Of course, this could be done even before then, but the content of the mix might not be settled.  We likely need to check it on a few implementations first.&lt;/p&gt;

&lt;p&gt;For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained.  For example, FU Berlin could give people a login to get their recently published results fixed.  Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal.&lt;/p&gt;

&lt;p&gt;As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment.  They can set up and tune their systems, and perform the runs.  We will just watch.  As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data.  Like this, both parties get to see the others&amp;#39; technology with proper tuning and installation.  What, if anything, is reported about this activity is up to the owner of the technology being tested.  We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these.  This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user.  If you wish to take advantage of this offer, you may contact &lt;a href=&quot;mailto:hwilliams@openlinksw.com?subject=Collaborative RDF Benchmark&quot; id=&quot;link-id0x1c071100&quot;&gt;Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice.&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;The next post will talk about the &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1678&quot; id=&quot;link-id0x19933fd8&quot;&gt;actual content of benchmarks&lt;/a&gt;.  The milestone after this will be when we publish the measurement and reporting protocols.&lt;/p&gt;


&lt;h3&gt;
&lt;i&gt;Benchmarks, Redux&lt;/i&gt; Series&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1658&quot; id=&quot;link-id0x1c554800&quot;&gt;Benchmarks, Redux (part 1): On RDF Benchmarks&lt;/a&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1660&quot; id=&quot;link-id0x1ec159e8&quot;&gt;Benchmarks, Redux (part 2): A Benchmarking Story&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1663&quot; id=&quot;link-id0x1dd5eb10&quot;&gt;Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1665&quot; id=&quot;link-id0x18f05940&quot;&gt;Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1667&quot; id=&quot;link-id0x1ed5ef10&quot;&gt;Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1669&quot; id=&quot;link-id0x1e9cb130&quot;&gt;Benchmarks, Redux (part 6): BSBM and I/O, continued&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1671&quot; id=&quot;link-id0x1dfa79d8&quot;&gt;Benchmarks, Redux (part 7): What Does BSBM Explore Measure?&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
 &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1673&quot; id=&quot;link-id0x1eb6f478&quot;&gt;Benchmarks, Redux (part 8): BSBM Explore and Update &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1675&quot; id=&quot;link-id0x1de5a918&quot;&gt;Benchmarks, Redux (part 9): BSBM With Cluster&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process &lt;i&gt;(this post)&lt;/i&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1678&quot; id=&quot;link-id0x1dae9060&quot;&gt;Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1f45fa10&quot;&gt;Benchmarks, Redux (part 12): Our Own BSBM Results Report&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1f49d2b8&quot;&gt;Benchmarks, Redux (part 13): BSBM BI Modifications &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1e68e4c8&quot;&gt;Benchmarks, Redux (part 14): BSBM BI Mix &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1e353858&quot;&gt;Benchmarks, Redux (part 15): BSBM Test Driver Enhancements &lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Benchmarks, Redux (part 1): On RDF Benchmarks</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-02-28#1659</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1659#comments</comments><pubDate>Mon, 28 Feb 2011 20:20:22 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2011-03-14T17:16:34.000002-04:00</n0:modified><description>&lt;p&gt;This post introduces a series on &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1e724ae0&quot;&gt;RDF&lt;/a&gt; benchmarking. In these posts I will cover the following:&lt;/p&gt;

&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;Correct misleading &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1e325480&quot;&gt;information&lt;/a&gt; about us in the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html&quot; id=&quot;link-id0x1ded41d0&quot;&gt;recent Berlin report&lt;/a&gt;: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.&lt;/p&gt;
 &lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Discuss configuration options for &lt;a class=&quot;auto-href&quot; href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1e0a2548&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Tell a story about multithreading and its perils and how vectoring and scale-out can save us.&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la &lt;a class=&quot;auto-href&quot; href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x1e0ef4f0&quot;&gt;TPC&lt;/a&gt; or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in &lt;a class=&quot;auto-href&quot; href=&quot;http://lod2.eu/&quot; id=&quot;link-id0x1e54d3d8&quot;&gt;LOD2&lt;/a&gt;, the EU FP7 that also funded the recent Berlin report.&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Talk about &lt;a class=&quot;auto-href&quot; href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1e730bc8&quot;&gt;BSBM&lt;/a&gt; in specific. What does it measure?&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Discuss some experiments with the BI use case of BSBM.&lt;/p&gt;
&lt;/li&gt;

 &lt;li&gt;
  &lt;p&gt;Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1c1551e0&quot;&gt;RDBMS&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.&lt;/p&gt;

&lt;p&gt;For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor&amp;#39;s permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.&lt;/p&gt;

&lt;p&gt;In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project&amp;#39;s own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x11327f10&quot;&gt;data&lt;/a&gt; is not stretching anything.&lt;/p&gt;

&lt;p&gt;So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni&amp;#39;s cluster at &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Digital_Enterprise_Research_Institute&quot; id=&quot;link-id0x1bfaffa0&quot;&gt;DERI&lt;/a&gt; and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon &lt;a class=&quot;auto-href&quot; href=&quot;http://aws.amazon.com/ec2/&quot; id=&quot;link-id0x1bfafef8&quot;&gt;EC2&lt;/a&gt; might be an option, although not the preferred one.&lt;/p&gt;

&lt;p&gt;So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Ontotext&quot; id=&quot;link-id0x1eccc1e0&quot;&gt;Ontotext&lt;/a&gt; and &lt;a class=&quot;auto-href&quot; href=&quot;http://freebase.com/guid/9202a8c04000641f8000000005c908d6&quot; id=&quot;link-id0x1eccc208&quot;&gt;Garlik&lt;/a&gt; provided some information. We will look into these this and next week. We will not publish any information without asking first.&lt;/p&gt;

&lt;p&gt;In this series of posts I will only talk about &lt;a class=&quot;auto-href&quot; href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id0x1bfa4030&quot;&gt;OpenLink Software&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
&lt;i&gt;Benchmarks, Redux&lt;/i&gt; Series&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Benchmarks, Redux (part 1): On RDF Benchmarks &lt;i&gt;(this post)&lt;/i&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1660&quot; id=&quot;link-id0x1b668d10&quot;&gt;Benchmarks, Redux (part 2): A Benchmarking Story&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1663&quot; id=&quot;link-id0x1b3a0c08&quot;&gt;Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1665&quot; id=&quot;link-id0x1f9f1740&quot;&gt;Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1667&quot; id=&quot;link-id0x1ad929f8&quot;&gt;Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1669&quot; id=&quot;link-id0x1db437c0&quot;&gt;Benchmarks, Redux (part 6): BSBM and I/O, continued&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1671&quot; id=&quot;link-id0x17138c38&quot;&gt;Benchmarks, Redux (part 7): What Does BSBM Explore Measure?&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
 &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1673&quot; id=&quot;link-id0x1c0e74f8&quot;&gt;Benchmarks, Redux (part 8): BSBM Explore and Update &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1675&quot; id=&quot;link-id0x1f297d10&quot;&gt;Benchmarks, Redux (part 9): BSBM With Cluster&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
 &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1677&quot; id=&quot;link-id0x1e4994b8&quot;&gt;Benchmarks, Redux (part 10): LOD2 and the Benchmark Process&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
 &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1678&quot; id=&quot;link-id0x1ebea6d0&quot;&gt;Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1d5c86c0&quot;&gt;Benchmarks, Redux (part 12): Our Own BSBM Results Report&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1efec0e0&quot;&gt;Benchmarks, Redux (part 13): BSBM BI Modifications &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1a9941f8&quot;&gt;Benchmarks, Redux (part 14): BSBM BI Mix &lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=&quot; id=&quot;link-id0x1ea26de8&quot;&gt;Benchmarks, Redux (part 15): BSBM Test Driver Enhancements &lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Virtuoso Directions for 2011</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-01-19#1650</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1650#comments</comments><pubDate>Wed, 19 Jan 2011 16:29:37 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2011-01-20T12:54:42.000002-05:00</n0:modified><description>&lt;p&gt;
&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1603&quot; id=&quot;link-id0x1d584720&quot;&gt;At the start of 2010, I wrote&lt;/a&gt; that 2010 would be the year when &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x2007b778&quot;&gt;RDF&lt;/a&gt; became performance- and cost-competitive with relational technology for &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x7f5bf68&quot;&gt;data&lt;/a&gt; warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x1ffa18b0&quot;&gt;schema&lt;/a&gt; change.&lt;/p&gt;

&lt;p&gt;I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.&lt;/p&gt;

&lt;p&gt;At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, &lt;i&gt;column-wise compression&lt;/i&gt; means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. &lt;i&gt;Vectored execution&lt;/i&gt; means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.&lt;/p&gt;

&lt;p&gt;So, during 2010, we integrated these technologies into &lt;a class=&quot;auto-href&quot; href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1fdf3f90&quot;&gt;Virtuoso&lt;/a&gt;, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso&amp;#39;s relational speed is not up there with the best of analytics-oriented &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x7bf0d40&quot;&gt;RDBMS&lt;/a&gt;. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented &lt;code&gt;&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Hash_join&quot; id=&quot;link-id0x7815c60&quot;&gt;HASH JOIN&lt;/a&gt;&lt;/code&gt; and &lt;code&gt;GROUP BY&lt;/code&gt;, and fine-tuned many of the tricks required by &lt;a class=&quot;auto-href&quot; href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x213d6de8&quot;&gt;TPC&lt;/a&gt;-&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x1fd92690&quot;&gt;H. TPC-H&lt;/a&gt; is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.&lt;/p&gt;

&lt;p&gt;At the Semdata workshop of &lt;a class=&quot;auto-href&quot; href=&quot;http://www.vldb2010.org/&quot; id=&quot;link-id0x21178a50&quot;&gt;VLDB 2010&lt;/a&gt; &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1632&quot; id=&quot;link-id0x1de8fee8&quot;&gt;we presented some results&lt;/a&gt; of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.&lt;/p&gt;

&lt;p&gt;A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Central_processing_unit&quot; id=&quot;link-id0x7ae0d58&quot;&gt;CPU&lt;/a&gt; &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Cache&quot; id=&quot;link-id0x7bb7150&quot;&gt;cache&lt;/a&gt; and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso&amp;#39;s column store implementation resembles in broad outline other column stores like &lt;a class=&quot;auto-href&quot; href=&quot;http://www.vertica.com/&quot; id=&quot;link-id0x7f61080&quot;&gt;Vertica&lt;/a&gt; or &lt;a class=&quot;auto-href&quot; href=&quot;http://www.ingres.com/vectorwise/&quot; id=&quot;link-id0x2154ce38&quot;&gt;VectorWise&lt;/a&gt;, the main difference being the built-in support for run-time heterogenous types.&lt;/p&gt;

&lt;p&gt;The &lt;a class=&quot;auto-href&quot; href=&quot;http://lod2.eu/&quot; id=&quot;link-id0x755e668&quot;&gt;LOD2&lt;/a&gt; EU FP 7 project &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1630&quot; id=&quot;link-id0x1d8eaf28&quot;&gt;started in September 2010&lt;/a&gt;. In this project OpenLink and the celebrated heroes of the column store, &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science&quot; id=&quot;link-id0x1feba470&quot;&gt;CWI&lt;/a&gt; of &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x223bbe70&quot;&gt;MonetDB&lt;/a&gt; and VectorWise fame, represent the database side.&lt;/p&gt;

&lt;p&gt;The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The &lt;a class=&quot;auto-href&quot; href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x20f50c20&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a class=&quot;auto-href&quot; href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x780c430&quot;&gt;BSBM&lt;/a&gt;) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.&lt;/p&gt;

&lt;p&gt;LOD2 will continue by &lt;i&gt;linking the universe,&lt;/i&gt; as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the &amp;quot;RDF tax,&amp;quot; by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.&lt;/p&gt;

&lt;p&gt;So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.&lt;/p&gt;

&lt;p&gt;For now, our priority is to release the substantial gains that have already been accomplished.&lt;/p&gt;

&lt;p&gt;After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x25716618&quot;&gt;SPARQL&lt;/a&gt; and seeing how it goes. In &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1627&quot; id=&quot;link-id0x1af60d40&quot;&gt;the September paper&lt;/a&gt; we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x223b0a88&quot;&gt;SQL&lt;/a&gt; and SPARQL, should make a good VLDB paper.&lt;/p&gt;

&lt;p&gt;Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0x1f72a180&quot;&gt;C&lt;/a&gt;) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.&lt;/p&gt;

&lt;p&gt;The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.&lt;/p&gt;

&lt;p&gt;Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.&lt;/p&gt;

&lt;p&gt;The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Rule_Interchange_Format&quot; id=&quot;link-id0x22b3ad68&quot;&gt;RIF&lt;/a&gt; and &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x22b3ad90&quot;&gt;OWL&lt;/a&gt; is not expressive enough for the real world. As one expert put it, &lt;i&gt;if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases,&lt;/i&gt; which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?&lt;/p&gt;

&lt;p&gt;Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.&lt;/p&gt;

&lt;p&gt;These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Datalog&quot; id=&quot;link-id0x7ee5130&quot;&gt;Datalog&lt;/a&gt;, is the widespread adoption of RDF and &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x2111f968&quot;&gt;linked data&lt;/a&gt; as a data publishing format, with all the schema-last and &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Open_world_assumption&quot; id=&quot;link-id0x2111f990&quot;&gt;open world&lt;/a&gt; aspects that have been there from the start.&lt;/p&gt;

&lt;p&gt;Stay tuned for more news later this month!&lt;/p&gt;

&lt;h3&gt;Related&lt;/h3&gt;
&lt;ul&gt;
 &lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1603&quot; id=&quot;link-id0x1de6b370&quot;&gt;Linked Data and Virtuoso in 2010&lt;/a&gt;
 &lt;/li&gt;
 &lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1510&quot; id=&quot;link-id0x1b031180&quot;&gt;Linked Data &amp;amp; The Year 2009&lt;/a&gt;
 &lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1286&quot; id=&quot;link-id0x1a582d10&quot;&gt;Retrospective and Outlook for 2008&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>VLDB Semdata Workshop</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1635</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1635#comments</comments><pubDate>Tue, 21 Sep 2010 21:14:14 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2010-09-21T16:22:18-04:00</n0:modified><description>&lt;p&gt;I will begin by extending my thanks to the organizers, in specific &lt;a href=&quot;http://members.deri.at/~retok&quot; id=&quot;link-id0x236ebfd0&quot;&gt;Reto Krummenacher&lt;/a&gt; of &lt;a href=&quot;http://www.sti-innsbruck.at/&quot; id=&quot;link-id0x2371aca8&quot;&gt;STI&lt;/a&gt; and Atanas Kiryakov of &lt;a href=&quot;http://dbpedia.org/resource/Ontotext&quot; id=&quot;link-id0x22e24190&quot;&gt;Ontotext&lt;/a&gt; for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x23781ba8&quot;&gt;data&lt;/a&gt; management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://semanticweb.org/id/Franz_Inc&quot; id=&quot;link-id0x22e09fa8&quot;&gt;Franz&lt;/a&gt;, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, &lt;a href=&quot;http://data.semanticweb.org/person/jans-aasman&quot; id=&quot;link-id0x2380e7c8&quot;&gt;Jans Aasman&lt;/a&gt; of Franz talked about the telco call center automation solution by Amdocs, where the &lt;a href=&quot;http://semanticweb.org/id/AllegroGraph&quot; id=&quot;link-id0x237c9408&quot;&gt;AllegroGraph&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x236f96a8&quot;&gt;RDF&lt;/a&gt; store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.&lt;/p&gt;

&lt;p&gt;I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.&lt;/p&gt;

&lt;p&gt;One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x22ff2c78&quot;&gt;URI&lt;/a&gt; strings being stored in a separate table?&lt;/p&gt;

&lt;p&gt;The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x236e43f8&quot;&gt;SPARQL&lt;/a&gt;, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x22e31050&quot;&gt;schema&lt;/a&gt; will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.&lt;/p&gt;

&lt;p&gt;Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0x237d76e0&quot;&gt;optimization&lt;/a&gt; is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.&lt;/p&gt;

&lt;p&gt;Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x23845418&quot;&gt;XML&lt;/a&gt; (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a &lt;a href=&quot;http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science&quot; id=&quot;link-id0x22feefa0&quot;&gt;CWI&lt;/a&gt; prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing &lt;a href=&quot;http://dbpedia.org/resource/XPath&quot; id=&quot;link-id0x235b5890&quot;&gt;XPath&lt;/a&gt; and XSLT on the values, is entirely possible, at least in &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x237f6428&quot;&gt;Virtuoso&lt;/a&gt; which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x238265a8&quot;&gt;RDBMS&lt;/a&gt; with local and federated &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x236f7bc8&quot;&gt;SQL&lt;/a&gt; is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.&lt;/p&gt;

&lt;p&gt;Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.&lt;/p&gt;

&lt;p&gt;With all this cross-model operation, RDF is definitely not a closed island. We&amp;#39;ll have to repeat this more.&lt;/p&gt;

&lt;p&gt;Of the academic papers, the SpiderStore (&lt;a href=&quot;http://dbis-informatik.uibk.ac.at/5-1-Publications.html&quot; id=&quot;link-id0x19ecd3f0&quot;&gt;paper&lt;/a&gt; is not yet available at time of writing, but should be soon) and &lt;a href=&quot;http://www.few.vu.nl/~jui200/webpie.html&quot; id=&quot;link-id0x1d60a498&quot;&gt;Webpie&lt;/a&gt; that should be specially noted.&lt;/p&gt;

&lt;p&gt;Let us talk about SpiderStore first.&lt;/p&gt;

&lt;h2&gt;SpiderStore&lt;/h2&gt;

&lt;p&gt;The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.&lt;/p&gt;

&lt;p&gt;According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and GÃ¼nther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.&lt;/p&gt;

&lt;p&gt;This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type &lt;a href=&quot;http://dbpedia.org/resource/Tag&quot; id=&quot;link-id0x236fe4d0&quot;&gt;tag&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.&lt;/p&gt;

&lt;p&gt;But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into &lt;a href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0x235a2228&quot;&gt;C&lt;/a&gt; procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.&lt;/p&gt;

&lt;p&gt;SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in &lt;a href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x236e14a0&quot;&gt;MonetDB&lt;/a&gt; or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.&lt;/p&gt;

&lt;p&gt;We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.&lt;/p&gt;

&lt;p&gt;If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.&lt;/p&gt;

&lt;p&gt;Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better &lt;a href=&quot;http://dbpedia.org/resource/Cache&quot; id=&quot;link-id0x237eb508&quot;&gt;cache&lt;/a&gt; behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.&lt;/p&gt;

&lt;h2&gt;Webpie&lt;/h2&gt;

&lt;p&gt;Webpie from &lt;a href=&quot;http://www.vu.nl/&quot; id=&quot;link-id0x23811bf8&quot;&gt;VU Amsterdam&lt;/a&gt; and the &lt;a href=&quot;http://www.larkc.eu/&quot; id=&quot;link-id0x22ff8fe8&quot;&gt;LarKC&lt;/a&gt; EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x238482a0&quot;&gt;OWL&lt;/a&gt; Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.&lt;/p&gt;

&lt;p&gt;Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.&lt;/p&gt;

&lt;p&gt;The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL &lt;code&gt;INSERT â¦ SELECT&lt;/code&gt; statements until no new inserts are produced. The only requirement is that the &lt;code&gt;INSERT&lt;/code&gt; statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.&lt;/p&gt;

&lt;p&gt;We have suggested such an experiment to the LarKC people. We will see.&lt;/p&gt;</description></item><item><title>LOD2 Kick Off</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1633</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1633#comments</comments><pubDate>Tue, 21 Sep 2010 21:13:03 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2010-09-21T16:22:12-04:00</n0:modified><description>&lt;p&gt;The &lt;a href=&quot;http://lod2.eu/&quot; id=&quot;link-id0x22e06810&quot;&gt;LOD2&lt;/a&gt; &lt;a href=&quot;http://lod2.eu/BlogPost/9-press-release-lod2-project-launch.html&quot; id=&quot;link-id0x18c0c770&quot;&gt;kick off meeting&lt;/a&gt; was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term.&lt;/p&gt;

&lt;p&gt;As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x22f9ba70&quot;&gt;RDF&lt;/a&gt; benchmarks in February.&lt;/p&gt;

&lt;p&gt;The LOD2 repository is a fusion of the OpenLink &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x2378d288&quot;&gt;LOD&lt;/a&gt; &lt;a href=&quot;http://lod.openlinksw.com/&quot; id=&quot;link-id0x23908828&quot;&gt;Cloud&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Cache&quot; id=&quot;link-id0x2378e6c8&quot;&gt;Cache&lt;/a&gt; (which includes &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x237d7d20&quot;&gt;data&lt;/a&gt; from &lt;a href=&quot;http://uriburner.com/&quot; id=&quot;link-id0x237c9408&quot;&gt;URIBurner&lt;/a&gt; and &lt;a href=&quot;http://www.pingthesemanticweb.com/&quot; id=&quot;link-id0x235b03b0&quot;&gt;PingTheSemanticWeb&lt;/a&gt;) and &lt;a href=&quot;http://sindice.com/&quot; id=&quot;link-id0x22e24190&quot;&gt;Sindice&lt;/a&gt;, both hosted at &lt;a href=&quot;http://dbpedia.org/resource/Digital_Enterprise_Research_Institute&quot; id=&quot;link-id0x237b80f8&quot;&gt;DERI&lt;/a&gt;. The value-add compared to Sindice or the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x237b63c0&quot;&gt;Virtuoso&lt;/a&gt;-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x237f7568&quot;&gt;SPARQL&lt;/a&gt; of Virtuoso.&lt;/p&gt;

&lt;p&gt;Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise &lt;a href=&quot;http://dbpedia.org/resource/Data_compression&quot; id=&quot;link-id0x235b0c38&quot;&gt;key compression&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.&lt;/p&gt;

&lt;p&gt;As for benchmarks, I just compiled &lt;a href=&quot;http://www.openlinksw.com/weblogs/oerling/&quot; id=&quot;link-id0x1c29e720&quot;&gt;a draft of suggested extensions to the BSBM&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x22e31050&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt;). I talked about this with &lt;a href=&quot;http://nl.linkedin.com/in/peterboncz&quot; id=&quot;link-id0x237c90b0&quot;&gt;Peter Boncz&lt;/a&gt; and &lt;a href=&quot;http://data.semanticweb.org/person/christian-bizer&quot; id=&quot;link-id0x23813340&quot;&gt;Chris Bizer&lt;/a&gt;, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x236f7ef8&quot;&gt;schema&lt;/a&gt; and that RDF offers no fundamental edge for the workload.&lt;/p&gt;

&lt;p&gt;There was a graph benchmark talk at the &lt;a href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x236f8170&quot;&gt;TPC&lt;/a&gt; workshop at &lt;a href=&quot;http://www.vldb2010.org/&quot; id=&quot;link-id0x235c6b90&quot;&gt;VLDB 2010&lt;/a&gt;. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.&lt;/p&gt;

&lt;p&gt;We did informally talk about a process for publishing with our colleagues from &lt;a href=&quot;http://semanticweb.org/id/Franz_Inc&quot; id=&quot;link-id0x23781d28&quot;&gt;Franz&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Ontotext&quot; id=&quot;link-id0x23782740&quot;&gt;Ontotext&lt;/a&gt; at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.&lt;/p&gt;

&lt;p&gt;Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x235a3568&quot;&gt;H&lt;/a&gt; in &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x22e67370&quot;&gt;SQL&lt;/a&gt; and SPARQL. The SQL will be Virtuoso, &lt;a href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x22e70db0&quot;&gt;MonetDB&lt;/a&gt;, and possibly &lt;a href=&quot;http://www.ingres.com/vectorwise/&quot; id=&quot;link-id0x2378f750&quot;&gt;VectorWise&lt;/a&gt; and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.&lt;/p&gt;

&lt;p&gt;In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.&lt;/p&gt;

&lt;p&gt;LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at &lt;a href=&quot;http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science&quot; id=&quot;link-id0x23911830&quot;&gt;CWI&lt;/a&gt; to RDF use cases.&lt;/p&gt;

&lt;p&gt;This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.&lt;/p&gt;

&lt;p&gt;LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, &lt;a href=&quot;http://dbpedia.org/resource/Greenplum&quot; id=&quot;link-id0x22feb520&quot;&gt;Greenplum&lt;/a&gt;, or &lt;a href=&quot;http://www.vertica.com/&quot; id=&quot;link-id0x237f7428&quot;&gt;Vertica&lt;/a&gt; can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the &lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x235c2e28&quot;&gt;Berkeley Orders Of Magnitude&lt;/a&gt; (&lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x2380e7c8&quot;&gt;BOOM&lt;/a&gt;) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.&lt;/p&gt;

&lt;p&gt;From our viewpoint, the project&amp;#39;s gains include:&lt;/p&gt;

&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.&lt;/p&gt;
 &lt;/li&gt;
&lt;li&gt;
  &lt;p&gt;Ready to use toolbox for data integration, including schema alignment and resolution of coreference.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;p&gt;Data discovery, summarization and visualization&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x236e14a0&quot;&gt;linked data&lt;/a&gt;. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.&lt;/p&gt;

&lt;p&gt;The use cases fit the development profile well: &lt;/p&gt;
&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;
    &lt;a href=&quot;http://dbpedia.org/resource/Wolters_Kluwer&quot; id=&quot;link-id0x23820568&quot;&gt;Wolters Kluwer&lt;/a&gt; will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.&lt;/p&gt;
 &lt;/li&gt;
&lt;li&gt;
  &lt;p&gt;
    &lt;a href=&quot;http://dbpedia.org/resource/Exalead&quot; id=&quot;link-id0x22e50ba0&quot;&gt;Exalead&lt;/a&gt; will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;p&gt;The Open &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x236fb248&quot;&gt;Knowledge&lt;/a&gt; Foundation will create a portal of all government published data for easy access by citizens.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In all these cases, the integration requirements of schema alignment, resolution of identity, &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x2381ebb0&quot;&gt;information&lt;/a&gt; extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.&lt;/p&gt;</description></item><item><title>Fault Tolerance in Virtuoso Cluster Edition (Short Version)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-07#1621</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1621#comments</comments><pubDate>Wed, 07 Apr 2010 16:40:02 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2010-04-14T19:12:47.000003-04:00</n0:modified><description>&lt;p&gt;We have for some time had the option of storing &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x28eb2178&quot;&gt;data&lt;/a&gt; in a cluster in multiple copies, in the Commercial Edition of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x25178ed0&quot;&gt;Virtuoso&lt;/a&gt;. (This feature is not in and is not planned to be added to the Open Source Edition.)&lt;/p&gt;

&lt;p&gt;Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x21fea428&quot;&gt;knowledge&lt;/a&gt; of how things actually work.&lt;/p&gt;

&lt;p&gt;So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.&lt;/p&gt;

&lt;p&gt;Three types of individuals need to know about fault tolerance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executives: What does it cost? Will it really eliminate downtime?&lt;/li&gt;
&lt;li&gt;System Administrators: Is it hard to configure? What do I do when I get an alert?&lt;/li&gt;
&lt;li&gt;Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I will explain the matter to each of these three groups:&lt;/p&gt;

&lt;h2&gt;Executives&lt;/h2&gt;

&lt;p&gt;The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money&amp;#39;s worth of read throughput and half the money&amp;#39;s worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.&lt;/p&gt;

&lt;p&gt;Server instances are grouped in &amp;quot;quorums&amp;quot; of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.&lt;/p&gt;

&lt;p&gt;The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please &lt;a href=&quot;http://www.openlinksw.com/contact/&quot; id=&quot;link-id0x2bdb0db8&quot;&gt;contact us&lt;/a&gt; for specifics.&lt;/p&gt;

&lt;p&gt;Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.&lt;/p&gt;

&lt;h2&gt; System Administrators&lt;/h2&gt;

&lt;p&gt;To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server&amp;#39;s network cables to each switch, to cover switch failures.&lt;/p&gt;

&lt;p&gt;When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.&lt;/p&gt;

&lt;p&gt;Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.&lt;/p&gt;

&lt;h2&gt; Application Developers/Programmers&lt;/h2&gt;

&lt;p&gt;An application can connect to any server instance in the cluster and have access to the same data, with full &lt;a href=&quot;http://dbpedia.org/resource/ACID&quot; id=&quot;link-id0x6451870&quot;&gt;ACID&lt;/a&gt; properties.&lt;/p&gt;

&lt;p&gt;There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.&lt;/p&gt;

&lt;p&gt;For the missing server instance, the application should try to reconnect. An &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x28e859b8&quot;&gt;ODBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x28e11940&quot;&gt;JDBC&lt;/a&gt; connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.&lt;/p&gt;

&lt;p&gt;For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (&lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x2bda4e40&quot;&gt;SQL&lt;/a&gt; State 40001) as best practices dictate, there is no change needed.&lt;/p&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;In summary...&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited extra cost for fault tolerance; no equipment sitting idle.&lt;/li&gt;
&lt;li&gt;Easy operation: Replace servers when they fail; the cluster does the rest.&lt;/li&gt;
&lt;li&gt;No changes needed to most applications.&lt;/li&gt;
&lt;li&gt;No proprietary SQL APIs or special fault tolerance logic needed in applications.&lt;/li&gt;
&lt;li&gt;Fully transactional programming model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All the above applies to both the Graph Model (&lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x22606f10&quot;&gt;RDF&lt;/a&gt;) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please &lt;a href=&quot;http://www.openlinksw.com/contact/&quot; id=&quot;link-id0x24f35648&quot;&gt;contact OpenLink Software&lt;/a&gt; Sales for details of availability or for getting advance evaluation copies.&lt;/p&gt;

&lt;h3&gt;
&lt;a href=&quot;http://dbpedia.org/resource/Glossary&quot; id=&quot;link-id0x6648890&quot;&gt;Glossary&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Cluster (VC)&lt;/b&gt; -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Cluster Node (VCN)&lt;/b&gt; -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Host Cluster (VHC)&lt;/b&gt; -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Host Cluster Node (VHCN)&lt;/b&gt; -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Server Instance (VSI)&lt;/b&gt; -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs.  May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Also see&lt;/h3&gt;
&lt;ul&gt;
 &lt;li&gt;
  &lt;a href=&quot;http://www.gbcacm.org/sites/www.gbcacm.org/files/slides/SpecialRelativity[1]_0.pdf&quot; id=&quot;link-id0x1320f1e8&quot;&gt;Special Relativity and the Problem of Database Scalability (PDF)&lt;/a&gt;, by James Starkey of &lt;a href=&quot;http://www.nimbusdb.com/&quot; id=&quot;link-id0x1320f2b0&quot;&gt;NimbusDB, Inc.&lt;/a&gt;
 &lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>SemData@Sofia Roundtable write-up</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-03-15#1615</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1615#comments</comments><pubDate>Mon, 15 Mar 2010 14:46:57 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2010-03-22T12:34:40.000010-04:00</n0:modified><description>&lt;p&gt;There was last week an &lt;a href=&quot;http://www.semdata.org/&quot; id=&quot;link-id11a83cf98&quot;&gt;invitation-based roundtable&lt;/a&gt; about semantic &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1d37f598&quot;&gt;data&lt;/a&gt; management in &lt;a href=&quot;http://www.dbpedia.org/resource/Sofia&quot; id=&quot;link-id0x1ba4a208&quot;&gt;Sofia, Bulgaria&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Lots of smart people together. The meeting was hosted by &lt;a href=&quot;http://dbpedia.org/resource/Ontotext&quot; id=&quot;link-id0x1cfc83f8&quot;&gt;Ontotext&lt;/a&gt; and chaired by &lt;a href=&quot;http://www.dbpedia.org/resource/Dieter_Fensel&quot; id=&quot;link-id0x1dc6e0d0&quot;&gt;Dieter Fensel&lt;/a&gt;. On the database side we had Ontotext, &lt;a href=&quot;http://www.systap.com/&quot; id=&quot;link-id0x1cda77f0&quot;&gt;SYSTAP&lt;/a&gt; (&lt;a href=&quot;http://www.systap.com/bigdata.htm&quot; id=&quot;link-id0x1dba6a30&quot;&gt;Bigdata&lt;/a&gt;), &lt;a href=&quot;http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science&quot; id=&quot;link-id0x1d8e1d88&quot;&gt;CWI&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x1d8cbcf0&quot;&gt;MonetDB&lt;/a&gt;), &lt;a href=&quot;http://www.dbpedia.org/resource/Karlsruhe_Institute_of_Technology&quot; id=&quot;link-id0x1e204cb0&quot;&gt;Karlsruhe Institute of Technology&lt;/a&gt; (YARS2/&lt;a href=&quot;http://swse.deri.ie/&quot; id=&quot;link-id0x1e653bf0&quot;&gt;SWSE&lt;/a&gt;). &lt;a href=&quot;http://www.larkc.eu/&quot; id=&quot;link-id0x1e6a4408&quot;&gt;LarKC&lt;/a&gt; was well represented, being our hosts, with STI, Ontotext, CYC, and &lt;a href=&quot;http://www.vu.nl/&quot; id=&quot;link-id0x1c8a6090&quot;&gt;VU Amsterdam&lt;/a&gt;. Notable absences were &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x1e5ab690&quot;&gt;Oracle&lt;/a&gt;, &lt;a href=&quot;http://freebase.com/guid/9202a8c04000641f8000000005c908d6&quot; id=&quot;link-id0x1f5e5ff0&quot;&gt;Garlik&lt;/a&gt;, &lt;a href=&quot;http://semanticweb.org/id/Franz_Inc&quot; id=&quot;link-id0x1d9c08f0&quot;&gt;Franz&lt;/a&gt;, and &lt;a href=&quot;http://www.talis.com/&quot; id=&quot;link-id0x1d338b30&quot;&gt;Talis&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?&lt;/p&gt;

&lt;p&gt;I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?&lt;/p&gt;

&lt;p&gt;Michael &lt;a href=&quot;http://dbpedia.org/resource/Michael_Stonebraker&quot; id=&quot;link-id0x1da55730&quot;&gt;Stonebraker&lt;/a&gt; and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1d828310&quot;&gt;RDF&lt;/a&gt; storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id0x1cf1e620&quot;&gt;OpenLink Software&lt;/a&gt; and &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1cfbc1d8&quot;&gt;Virtuoso&lt;/a&gt; are in agreement with both sides, contradictory as this might sound. We took our &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1e1f6a20&quot;&gt;RDBMS&lt;/a&gt; and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.&lt;/p&gt;

&lt;p&gt;I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF&amp;#39;s run time typing needs.&lt;/p&gt;

&lt;p&gt;So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.&lt;/p&gt;

&lt;p&gt;After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.&lt;/p&gt;

&lt;p&gt;So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what&amp;#39;s been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.&lt;/p&gt;

&lt;p&gt;Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.&lt;/p&gt;

&lt;p&gt;Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.&lt;/p&gt;

&lt;p&gt;I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.&lt;/p&gt;

&lt;p&gt;Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you &lt;i&gt;will&lt;/i&gt; have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing &lt;i&gt;Ã  la&lt;/i&gt; MonetDB, this too integrates and applies to inference results.&lt;/p&gt;


&lt;p&gt;So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.&lt;/p&gt;


&lt;p&gt;Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.&lt;/p&gt;

&lt;p&gt;What do we need for this? We need close-to-parity with relational â doing your warehouse in RDF with the attendant agility thereof can&amp;#39;t cost 10x more to deploy than the equivalent relational solution.&lt;/p&gt;

&lt;p&gt;We also want to tell the key-value, anti-&lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x172e8c80&quot;&gt;SQL&lt;/a&gt; people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of &lt;a href=&quot;http://dbpedia.org/resource/ACID&quot; id=&quot;link-id0x1e0de2e8&quot;&gt;ACID&lt;/a&gt;, at least consistent read, availability, complex query, large scale.&lt;/p&gt;

&lt;p&gt;And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a &lt;a href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x1e3cb130&quot;&gt;TPC&lt;/a&gt;-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold. &lt;/p&gt;</description></item><item><title>European Commission and the Data Overflow</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-10-27#1586</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1586#comments</comments><pubDate>Tue, 27 Oct 2009 18:29:51 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-10-27T14:57:31-04:00</n0:modified><description>&lt;p&gt;The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x43bae00&quot;&gt;data&lt;/a&gt;.&lt;/p&gt;
 
&lt;p&gt;Since the &lt;a href=&quot;http://cordis.europa.eu/fp7/ict/content-knowledge/consultation_en.html&quot; id=&quot;link-id1191c0f8&quot;&gt;questionnaire is public&lt;/a&gt;, I am publishing my answers below.&lt;/p&gt;

&lt;ol type=&quot;1&quot; start=&quot;1&quot;&gt;
&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Data and data types&lt;/b&gt;
  &lt;/p&gt;

&lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? &lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional.  This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x5c7add0&quot;&gt;RDF&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x5c7adb8&quot;&gt;linked data&lt;/a&gt; principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data.  There is convergence around &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x5c7ada0&quot;&gt;DBpedia&lt;/a&gt; identifiers for real-world entities, e.g., most things that would be in the news.&lt;/p&gt;

&lt;p&gt;This also means that internal data processes and silos may be enriched with this content.  There is consequent pressure for accommodating more diversity of data, with more flexible &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x7d87a88&quot;&gt;schema&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data.  Examples are product catalogs, price lists, event schedules  and the like.&lt;/p&gt;

&lt;p&gt;The volume of the well known linked data sets is around 10 billion statements.  With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable,  This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.&lt;/p&gt;

&lt;p&gt;Relevant sections of this mass of data are a potential addition to any present or future analytics application.&lt;/p&gt;

&lt;p&gt;Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data.  This will drive database innovation for the next years even more than the continued classical warehouse growth.&lt;/p&gt;

&lt;p&gt;Science data is another driver of the data overflow.  For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data.  This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data.  Data and &lt;a href=&quot;http://dbpedia.org/resource/Metadata&quot; id=&quot;link-id0x7a3fb40&quot;&gt;metadata&lt;/a&gt; should travel together but may have different data models.&lt;/p&gt;

&lt;p&gt;By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible.  Restricted circles can and likely will implement similar ideas.&lt;/p&gt;
    &lt;/li&gt;

&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x5a48058&quot;&gt;knowledge&lt;/a&gt; graphs, 3D, sensor streams...)?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., &lt;i&gt;photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.&lt;/i&gt;
      &lt;/p&gt;

&lt;p&gt;Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.&lt;/p&gt;

&lt;p&gt;Interleaving of all database functions and types becomes increasingly important.&lt;/p&gt;
&lt;/li&gt;
  &lt;/ol&gt;
&lt;/li&gt;


&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Industries, communities&lt;/b&gt;
  &lt;/p&gt;

&lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;Who is producing these data and why? Could they do it better? How?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;Right now, projects such as &lt;a href=&quot;http://www.bio2rdf.org/&quot; id=&quot;link-id0x2a29de8&quot;&gt;Bio2RDF&lt;/a&gt;, &lt;a href=&quot;http://neurocommons.org/page/Main_Page&quot; id=&quot;link-id0x7ddaed0&quot;&gt;Neurocommons&lt;/a&gt;, and DBPedia produce this data.  The processes are in place and are reasonable.  Incremental improvement is to be expected.  These processes, along with the &lt;a href=&quot;http://www.w3.org/DesignIssues/LinkedData.html&quot; id=&quot;link-id0xbab4dfd0&quot;&gt;linked data meme&lt;/a&gt; generally taking off, drive demand for better &lt;a href=&quot;http://dbpedia.org/resource/Natural_language_processing&quot; id=&quot;link-id0x51f4e0&quot;&gt;NLP&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/Natural_language_processing&quot; id=&quot;link-id0x51a1b48&quot;&gt;Natural Language Processing&lt;/a&gt;), e.g., &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x956680&quot;&gt;entity&lt;/a&gt; and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).&lt;/p&gt;

&lt;p&gt;Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this.  The required baseline level has been reached; the rest is a matter of automating deployment.  Within the enterprise, there are advantages to be gained for &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x7da9e80&quot;&gt;information&lt;/a&gt; integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x71673f8&quot;&gt;URI&lt;/a&gt;.  Some of this information may even be published on an &lt;a href=&quot;http://dbpedia.org/resource/Extranet&quot; id=&quot;link-id0x9aa6e0&quot;&gt;extranet&lt;/a&gt; for self-service and web-service interfaces.  This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier.  Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.&lt;/p&gt;

&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;Who is consuming these data and why? Could they do it better? How?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;Consumers are various.  The greatest need is for tools that summarize complex data and allow getting a bird&amp;#39;s eye view of what data is in the first instance available.  Consuming the data is hindered by the user not even necessarily knowing what data there is.  This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x7f7b148&quot;&gt;SQL&lt;/a&gt; report generators and statistics packages.&lt;/p&gt;

&lt;p&gt;Where Web 2.0 made the &lt;i&gt;citizen journalist&lt;/i&gt;, the web of linked data will make the &lt;i&gt;citizen analyst&lt;/i&gt;.  For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful.  We may envision a &amp;quot;meshup economy&amp;quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.&lt;/p&gt;

&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What industrial sectors in Europe could become more competitive if they became much better at managing data?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;Any sector could benefit.  Early adopters are seen in the biomedical field and to an extent in media.  &lt;/p&gt;

&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;Is the regulation landscape imposing constraints (privacy, compliance ...) that don&amp;#39;t have today good tool support?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;The regulation landscape drives database demand through data retention requirements and the like.&lt;/p&gt;

&lt;p&gt;With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online.   Regulation is needed to protect individuals, but integration should still be possible for science.&lt;/p&gt;

&lt;p&gt;For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF.  This is possible but needs some more work.  Also, creating on-the-fly-anonymizing views on data might help.&lt;/p&gt;

&lt;p&gt;More research is needed for reconciling the need for security with the advantages of broad-based &lt;i&gt;ad hoc&lt;/i&gt; integration.  Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&amp;#39;s profile.  This is a tall order and implementing something of the sort is an open question.&lt;/p&gt;


&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;We have come across the following:&lt;/p&gt;

&lt;ul&gt;
        &lt;li&gt;Knowing that the data exists in the first place.&lt;/li&gt;
&lt;li&gt;If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.&lt;/li&gt;
&lt;li&gt;Compatible subject matter but incompatible representation:  For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument.  It is only to be expected that the time interval between measurements is not the same.  So there is need for a lot of one-off programming to align data.&lt;/li&gt;
      &lt;/ul&gt;

&lt;p&gt;Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network.  Computation needs to go to the data, and databases need to support this.&lt;/p&gt;

&lt;/li&gt;
  &lt;/ol&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Services, software stacks, protocols, standards, benchmarks&lt;/b&gt;
  &lt;/p&gt;

&lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What combinations of components are needed to deal with these problems?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;Recent times have seen a proliferation of special purpose databases.  Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility.  We see some of this in integration of map-reduce and scale-out databases.  The former antagonists have become partners. Vertica, &lt;a href=&quot;http://dbpedia.org/resource/Greenplum&quot; id=&quot;link-id0x7a94e70&quot;&gt;Greenplum&lt;/a&gt;, and OpenLink &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x2ab2868&quot;&gt;Virtuoso&lt;/a&gt; are example of DBMS featuring work in this direction.&lt;/p&gt;

&lt;p&gt;Interoperability and at least &lt;i&gt;de facto&lt;/i&gt; standards in ways of doing this will emerge.&lt;/p&gt;

&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What data exchange and processing mechanisms will be needed to work across platforms and programming languages?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;
        &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x78a0458&quot;&gt;HTTP&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x7ff2360&quot;&gt;XML&lt;/a&gt;, and RDF are in fact very verbose, yet these are the formats and models that have uptake.  Thus, these will continue to be used even though one might think binary formats to be more efficient.&lt;/p&gt;

&lt;p&gt;There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.&lt;/p&gt;

&lt;p&gt;For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue.  Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.&lt;/p&gt;


&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What data environments are today so wastefully messy that they would benefit from the development of standards?&lt;/b&gt;
    &lt;/p&gt;


&lt;p&gt;RDF and &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x5643d70&quot;&gt;OWL&lt;/a&gt; are not messy but they could use some more performance; we are working on this.  &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x152ab18&quot;&gt;SPARQL&lt;/a&gt; is finally acquiring the capabilities of a serious query language, so things are slowly coming together.&lt;/p&gt;

&lt;p&gt;Community process for developing application domain specific vocabularies works quite well, even though one could argue it is &lt;i&gt;ad hoc&lt;/i&gt; and not up to what a modeling purist might wish.&lt;/p&gt;

&lt;p&gt;Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.&lt;/p&gt;

&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What kind of performance is expected or required of these systems? Who will measure it reliably? How?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;Relational databases have a history of substantial investment in &lt;a href=&quot;http://dbpedia.org/resource/Program_optimization&quot; id=&quot;link-id0xecc100&quot;&gt;optimization&lt;/a&gt; and some of them are very good for what they do, e.g., the newer generation of analytics databases.&lt;/p&gt;

&lt;p&gt;The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.&lt;/p&gt;

&lt;p&gt;These trends will merge:  Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.&lt;/p&gt;

&lt;p&gt;We find RDF augmented with some binary types at this crossroads.  This point of the design space will have to provide performance roughly on the level of today&amp;#39;s best relational solution for workloads that fit the relational model.  The added cost of schema-last and inference must come down.  We are working on this.  Research work such as carried out with &lt;a href=&quot;http://dbpedia.org/resource/MonetDB&quot; id=&quot;link-id0x7ae2890&quot;&gt;MonetDB&lt;/a&gt; gives clues as to how these aims can be reached.&lt;/p&gt;

&lt;p&gt;The separation of query language and inference is artificial.  After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.&lt;/p&gt;

&lt;p&gt;Benchmarks are key.  Some gain can be had even from repurposing standard relational benchmarks like &lt;a href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x71eb528&quot;&gt;TPC&lt;/a&gt;-&lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x5e16a40&quot;&gt;H&lt;/a&gt;.  But the TPC-H rules do not allow official reporting of such.&lt;/p&gt;

&lt;p&gt;Development of benchmarks for RDF, complex queries, and inference is needed.  A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity.  A key-value store benchmark might also be conceived.  A transaction benchmark like TPC-&lt;a href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0x78562d0&quot;&gt;C&lt;/a&gt; might be the basis, maybe augmented with massive user-generated content like reviews and blogs.&lt;/p&gt;

&lt;p&gt;If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run â think of the high end TPC-C results â then TPC-style rules and processes would be quite adequate.  The threshold to publish should be lowered:  Everybody runs the TPC workloads internally but few publish.&lt;/p&gt;

&lt;p&gt;Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government.  Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.&lt;/p&gt;

&lt;p&gt;Benchmarks should be run by software vendors on their own systems, tuned by themselves.  But there should be a process of disclosure and auditing; the TPC rules give an example.  Compliance should not be too expensive or time consuming.  Some community development for automating these things would be a worthwhile target for EC funding.&lt;/p&gt;

&lt;/li&gt;
  &lt;/ol&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Usability and training&lt;/b&gt;
  &lt;/p&gt;

&lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt;

	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL.  For the linked data &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x7761e50&quot;&gt;web&lt;/a&gt;, the same will take place behind SPARQL.&lt;/p&gt;

&lt;p&gt;Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult.  The casual amateur is hereby warned.&lt;/p&gt;

&lt;p&gt;There is no single solution.  For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.&lt;/p&gt;

&lt;p&gt;Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.&lt;/p&gt;

&lt;p&gt;For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.&lt;/p&gt;

&lt;p&gt;For shipping functions in a cluster or cloud, the &lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x5494b0&quot;&gt;BOOM&lt;/a&gt; (&lt;a href=&quot;http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html&quot; id=&quot;link-id0x7f1f148&quot;&gt;Berkeley Orders Of Magnitude&lt;/a&gt;) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce.  The question is whether a &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x5c758c8&quot;&gt;PHP&lt;/a&gt; developer can be made to do logic programming.&lt;/p&gt;

&lt;p&gt;This bridge will be crossed only with actual need and even then reluctantly.  We may look at the Web 2.0 practice of sharding &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0x432f868&quot;&gt;MySQL&lt;/a&gt;, inconvenient as this may be, for an example.  There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, &lt;i&gt;post hoc&lt;/i&gt;, often a point solution.  One could argue that planning ahead would be smarter but by and large the world does not work so.&lt;/p&gt;

&lt;p&gt;One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce.  If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.&lt;/p&gt;

&lt;p&gt;This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this.  Therefore we wish to go for bold new application types for which the client-server database application is not the model.  Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there.  These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.&lt;/p&gt;

&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;For the most part, developers do not learn things for the sake of learning.  When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction.  The research world is often similarly insular.  A new inflection in the application landscape is needed to drive learning.  This inflection is provided by the &lt;a href=&quot;https://wiki.mozilla.org/Labs/Ubiquity&quot; id=&quot;link-id0x7f051c8&quot;&gt;ubiquity&lt;/a&gt; of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.&lt;/p&gt;

&lt;p&gt;RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML.  These new things should, within possibility, be deployed in the usual technology stack, &lt;a href=&quot;http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29&quot; id=&quot;link-id0x77151e0&quot;&gt;LAMP&lt;/a&gt; or Java.  Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.&lt;/p&gt;

&lt;p&gt;A lot of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x7940cd0&quot;&gt;semantic web&lt;/a&gt; potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.&lt;/p&gt;

&lt;p&gt;For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.&lt;/p&gt;

&lt;p&gt;The question is one of providing challenges.  Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training.  With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.&lt;/p&gt;

&lt;p&gt;As the data overflow proceeds, its victims will multiply and create demand for solutions.  The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.&lt;/p&gt;

&lt;p&gt;If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT.  This would create interest, and interest would drive training and dissemination.&lt;/p&gt;

&lt;p&gt;The problem is creating the pull.&lt;/p&gt;
&lt;/li&gt;
  &lt;/ol&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Challenges&lt;/b&gt;
  &lt;/p&gt;
&lt;ol type=&quot;a&quot; start=&quot;1&quot;&gt;

	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, &lt;a href=&quot;http://dbpedia.org/resource/Google&quot; id=&quot;link-id0x7e72f40&quot;&gt;Google&lt;/a&gt; Lunar X Prize, etc. ... ?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;The EC itself no doubt suffers from data overflow in one function or another.  Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start.  The more real the data, the better â reality is consistently more complex and surprising than imagination.  Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.&lt;/p&gt;

&lt;p&gt;Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.&lt;/p&gt;

&lt;p&gt;The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.&lt;/p&gt;

&lt;p&gt;The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded.  Otherwise investing in existing business development will be more interesting to industry.  Some industry participation seems necessary; we would wish academia and industry to work closer.  Also, having industry supply the baseline guarantees that academia actually does further the state of the art.  This is not always certain.&lt;/p&gt;

&lt;p&gt;If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia.  Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.&lt;/p&gt;


&lt;/li&gt;
	&lt;li&gt;
    &lt;p&gt;
        &lt;b&gt;What should one do  to set up such a challenge, administer, and monitor it?&lt;/b&gt;
    &lt;/p&gt;

&lt;p&gt;The EC should probably circulate a call for actual problem scenarios involving big data.  If the matter of the overflow is as dire as represented, cases should be easy to find.  A few should be selected and then anonymized if needed.&lt;/p&gt;

&lt;p&gt;The party with the use case would benefit by having hopefully the best work on it.  The contestants would benefit from having real world needs guide R&amp;amp;D.  The EC would not have to do very much, except possibly use some money for funding the best proposals.  The winner would possibly get a large account and related sales and service income.  The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.&lt;/p&gt;

&lt;p&gt;There may be a good benchmark at the time, possibly resulting from FP7 itself.  In such a case, the EC could offer a prize for winners.  Details would have to be worked out case by case.  Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.&lt;/p&gt;

&lt;p&gt;Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.&lt;/p&gt;

&lt;/li&gt;
  &lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
</description></item><item><title>Social Web Camp (#5 of 5)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1555</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1555#comments</comments><pubDate>Thu, 30 Apr 2009 16:14:02 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-04-30T12:51:54-04:00</n0:modified><description>&lt;p&gt;(Last of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0x112efd58&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.)

&lt;/p&gt;
&lt;p&gt;The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.&lt;/p&gt;

&lt;p&gt;By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks â one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x14e380b8&quot;&gt;information&lt;/a&gt; overload.&lt;/p&gt;

&lt;p&gt;Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.&lt;/p&gt;

&lt;p&gt;There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don&amp;#39;t get lost in it.&lt;/p&gt;

&lt;p&gt;There is &lt;a href=&quot;https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/index.html&quot; id=&quot;link-id170cb010&quot;&gt;a CIA memorandum about how analysts misinterpret data and see what they want to see&lt;/a&gt;. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.&lt;/p&gt;

&lt;p&gt;I participated in discussions on security and privacy and on mobile social networks and context.&lt;/p&gt;

&lt;p&gt;For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.&lt;/p&gt;

&lt;p&gt;There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies &lt;a id=&quot;link-id14aaff90&quot;&gt;Ã  la&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x13d77830&quot;&gt;SQL&lt;/a&gt; do not work well when schema is fluid and end-users can&amp;#39;t be expected to formulate or understand these. Remember &lt;a href=&quot;http://dbpedia.org/resource/Ted_Nelson&quot; id=&quot;link-id0x156ceae0&quot;&gt;Ted Nelson&lt;/a&gt;? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic&amp;#39;s Data Patrol should be a part of the social web infrastructure of the future.&lt;/p&gt;

&lt;p&gt;People at MIT have developed AIR (Accountability In &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x14e2abc0&quot;&gt;RDF&lt;/a&gt;) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.&lt;/p&gt;

&lt;p&gt;In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.&lt;/p&gt;

&lt;p&gt;For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one&amp;#39;s location at the granularity of the city; for some other purposes, one would say which conference room one is in.&lt;/p&gt;

&lt;p&gt;Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.&lt;/p&gt;

&lt;p&gt;There is a thin line between convenience and having IT infrastructure rule one&amp;#39;s life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as GÃ¶del put it. Similarly, human self-&lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x70d82ff8&quot;&gt;knowledge&lt;/a&gt;, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.&lt;/p&gt;</description></item><item><title>Linked Data &amp; The Year 2009 (updated)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-01-02#1511</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1511#comments</comments><pubDate>Fri, 02 Jan 2009 16:17:06 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-01-02T13:26:42.000003-05:00</n0:modified><description>&lt;p&gt;As is fitting for the season, I will editorialize a bit about what has gone before and what is to come.&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://www.w3.org/People/Berners-Lee/card#i&quot; id=&quot;link-id1119f250&quot;&gt;Sir Tim&lt;/a&gt; said it at WWW08 in &lt;a href=&quot;http://www2008.org/&quot; id=&quot;link-id0x1dcb93a0&quot;&gt;Beijing&lt;/a&gt; â &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x13a3efb8&quot;&gt;linked data&lt;/a&gt; and the linked data &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id0x13a44cd0&quot;&gt;web&lt;/a&gt; is the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x10d25788&quot;&gt;semantic web&lt;/a&gt; and the Web done right.&lt;/p&gt;

&lt;p&gt;The grail of &lt;i&gt;ad hoc&lt;/i&gt; analytics on infinite &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa201d518&quot;&gt;data&lt;/a&gt; has lost none of its appeal.  We have seen fresh evidence of this in the realm of data warehousing products, as well as storage in general.&lt;/p&gt;

&lt;p&gt;The benefits of a data model more abstract than the relational are being increasingly appreciated also outside the data web circles. Microsoft&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x12fa4e40&quot;&gt;Entity&lt;/a&gt; Frameworks technology is an example.  Agility has been a buzzword for a long time.  Everything should be offered in a service based business model and should interoperate and integrate with everything else â business needs first; schema last.&lt;/p&gt;

&lt;p&gt;Not to forget that when money is tight, reuse of existing assets and paying on a usage basis are naturally emphasized.  &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x175b32e8&quot;&gt;Information&lt;/a&gt;, as the asset it is, is none the less important, on the contrary.  But even with information, value should be realized economically, which, among other things, entails not reinventing the wheel.&lt;/p&gt;

&lt;p&gt;It is against this backdrop that this year will play out.&lt;/p&gt;

&lt;p&gt;As concerns research, I will &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1374&quot; id=&quot;link-id1151b128&quot;&gt;again quote&lt;/a&gt; &lt;a href=&quot;http://www.ibiblio.org/hhalpin/#&quot; id=&quot;link-id141cb740&quot;&gt;Harry Halpin&lt;/a&gt; at &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x18a8a858&quot;&gt;ESWC 2008&lt;/a&gt;: &amp;quot;Men will fight in a war, and even lose a war, for what they believe just.  And it may come to pass that later, even though the war were lost, the things then fought for will emerge under another name and establish themselves as the prevailing reality&amp;quot; [or words to this effect].&lt;/p&gt;

&lt;p&gt;Something like the data web, and even the semantic web, will happen. Harry&amp;#39;s question was whether this would be the descendant of what is today called semantic web research.&lt;/p&gt;

&lt;p&gt;I heard in conversation about a project for making a very large metadata store.  I also heard that the makers did not particularly insist on this being &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x3c39ed80&quot;&gt;RDF&lt;/a&gt;-based, though.&lt;/p&gt;

&lt;p&gt;Why should such a thing be RDF-based?  If it is already accepted that there will be &lt;i&gt;ad hoc&lt;/i&gt; schema and that queries ought to be able to view the data from all angles, not be limited by having indices one way and not another way, then why not RDF?&lt;/p&gt;

&lt;p&gt;The justification of RDF is in reusing and linking-to data and terminology out there.  Another justification is that by using an RDF store, one is spared a lot of work and tons of compromises which attend making an &lt;a href=&quot;http://dbpedia.org/resource/Entity-attribute-value_model&quot; id=&quot;link-id0x14a77880&quot;&gt;entity&lt;/a&gt;-attribute-value (&lt;a href=&quot;http://dbpedia.org/resource/Entity-attribute-value_model&quot; id=&quot;link-id0x5f978e88&quot;&gt;EAV&lt;/a&gt;, i.e., triple) store on a generic &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x391bdcd8&quot;&gt;RDBMS&lt;/a&gt;.  The sem-web world has been there, trust me.  We came out well because we put all inside the RDBMS, lowest level, which you can&amp;#39;t do unless you own the RDBMS.  Source access is not enough; you also need the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x138a3a00&quot;&gt;knowledge&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Technicalities aside, the question is one of proprietary vs. standards-based.  This is not only so with software components, where standards have consistently demonstrated benefits, but now also with the data. &lt;a href=&quot;http://www.zemanta.com/&quot; id=&quot;link-id0x5f92cb38&quot;&gt;Zemanta&lt;/a&gt; and &lt;a href=&quot;http://www.opencalais.com/&quot; id=&quot;link-id0x139c3200&quot;&gt;OpenCalais&lt;/a&gt; serving &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x1731dc78&quot;&gt;DBpedia&lt;/a&gt; URIs are examples.  Even in entirely closed applications, there is benefit in reusing open vocabularies and identifiers: One does not need to create a secret language for writing a secret memo.&lt;/p&gt;

&lt;p&gt;Where data is a carrier of value, its value is enhanced by it being easy to repurpose (i.e., standard vocabularies) and to discover (i.e., data set metadata).  As on the web, so on the enterprise &lt;a href=&quot;http://dbpedia.org/resource/Intranet&quot; id=&quot;link-id0x1324ada8&quot;&gt;intranet&lt;/a&gt;.  In this lies the strength of RDF as opposed to proprietary flexible database schemes.  This is a qualitative distinction.&lt;/p&gt;
&lt;p align=&quot;center&quot;&gt;
 &lt;a href=&quot;http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData&quot; id=&quot;link-id117178a8&quot;&gt;&lt;img src=&quot;http://www.openlinksw.com/images/logos/LoDLogo.gif&quot; alt=&quot;Linking Open Data project logo&quot; /&gt;
 &lt;/a&gt;
&lt;br /&gt;
 &lt;a href=&quot;http://dbpedia.org/resource/In_hoc_signo_vinces&quot; id=&quot;link-id115f47e8&quot;&gt;&lt;i&gt;In hoc signo vinces.&lt;/i&gt;
 &lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;In this light, we welcome the &lt;a href=&quot;http://semanticweb.org/wiki/VoiD&quot; id=&quot;link-id0x67cf560&quot;&gt;voiD&lt;/a&gt; (&lt;a href=&quot;http://semanticweb.org/wiki/VoiD&quot; id=&quot;link-id0x1898c908&quot;&gt;VOcabulary of Interlinked Data&lt;/a&gt;), which is the first promise of making federatable data discoverable. Now that there is a point of focus for these efforts, the needed expressivity will no doubt accrete around the voiD core.&lt;/p&gt;

&lt;p&gt;For data as a service, we clearly see the value of open terminologies as prerequisites for service interchangeability, i.e., creating a marketplace.  &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x1588d6a8&quot;&gt;XML&lt;/a&gt; is for the transaction; RDF is for the discovery, query, and analytics.  As with databases in general, first there was the transaction; then there was the query.  Same here.  For monetizing the query, there are models ranging from renting data sets and server capacity in the clouds to hosted services where one pays for processing past a certain quota.  For the hosted case, we just removed a major barrier to offering unlimited query against unlimited data when we completed the &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1374&quot; id=&quot;link-id110b8668&quot;&gt;Virtuoso Anytime&lt;/a&gt; feature.  With this, the user gets what is found within a set time, which is already something, and in case of needing more, one can pay for the usage.  Of course, we do not forget advertising.  When data has explicit semantics, contextuality is better than with keywords.&lt;/p&gt;

&lt;p&gt;For these visions to materialize on top of the linked data platform, linked data must join the world of data.  This means messaging that is geared towards the database public.  They know the problem, but the RDF proposition is still not well enough understood for it to connect.&lt;/p&gt;

&lt;p&gt;For the relational IT world, we offer passage to the data web and its promise of integration through RDF mapping.  We are also bringing out new Microsoft Entity &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET_Entity_Framework&quot; id=&quot;link-id0x13a50fd8&quot;&gt;Framework&lt;/a&gt; components.  This goes in the direction of defining a unified database frontier with RDF and non-RDF entity models side by side.&lt;/p&gt;

&lt;p&gt;For &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id0x1d2ea7f0&quot;&gt;OpenLink Software&lt;/a&gt;, 2008 was about developing technology for scale, RDF as well as generic relational.  We did show a tiny preview with the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x658fbc8&quot;&gt;Billion Triples Challenge&lt;/a&gt; demo.  Now we are set to come out with the real thing, featuring, among other things, faceted search at the billion triple scale.  We &lt;a href=&quot;http://www.openlinksw.com/blog/kidehen@openlinksw.com/blog/?id=1489&quot; id=&quot;link-id150c6090&quot;&gt;started offering ready-to-go Virtuoso-hosted linked open data sets&lt;/a&gt; on Amazon EC2 in December.  Now we continue doing this based on our next-generation server, as well as make Virtuoso 6 Cluster commercially available.  Technical specifics are amply discussed on this &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x1424ec20&quot;&gt;blog&lt;/a&gt;.  There are still some new technology things to be developed this year; first among these are strong &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x14b8ca88&quot;&gt;SPARQL&lt;/a&gt; federation, and on-the-fly resizing of server clusters.  On the research partnerships side, we have an EU grant for working with the OntoWiki project from the University of Leipzig, and we are partners in DERI&amp;#39;s &lt;a href=&quot;https://lion.deri.ie/&quot; id=&quot;link-id115c02f8&quot;&gt;LÃ­on project&lt;/a&gt;.  These will provide platforms for further demonstrating the &amp;quot;web&amp;quot; in data web, as in web-scale smart databasing.&lt;/p&gt;

&lt;p&gt;2009 will see change through scale.  The things that exist will start interconnecting and there will be emergent value.  Deployments will be larger and scale will be readily available through a services model or by installation at one&amp;#39;s own facilities.  We may see the start of Search becoming Find, like &lt;a href=&quot;http://myopenlink.net/dataspace/person/kidehen#this&quot; id=&quot;link-id14e43050&quot;&gt;Kingsley&lt;/a&gt; says, meaning semantics of data guiding search.  Entity extraction will multiply data volumes and bring parts of the data web to real time.&lt;/p&gt;

&lt;p&gt;Exciting 2009 to all.&lt;/p&gt;</description></item><item><title>Virtuoso Anytime:  No Query Is Too Complex (updated)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-11#1495</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1495#comments</comments><pubDate>Thu, 11 Dec 2008 16:13:10 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-12T10:29:23-05:00</n0:modified><description>
&lt;p&gt;A persistent argument against the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id1199d5f8&quot;&gt;linked data&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Giant_Global_Graph&quot; id=&quot;link-id116f2730&quot;&gt;web&lt;/a&gt; has been the cost, scalability, and vulnerability of &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id14e423c0&quot;&gt;SPARQL&lt;/a&gt; end points, should the linked data web gain serious mass and traffic.&lt;/p&gt;

&lt;p&gt;As we are on the brink of hosting the whole &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id1376a8b0&quot;&gt;DBpedia&lt;/a&gt; &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id113c8d20&quot;&gt;Linked Open Data&lt;/a&gt; cloud in &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id11425a78&quot;&gt;Virtuoso&lt;/a&gt; Cluster, we have had to think of what we&amp;#39;ll do if, for example, somebody decides to count all the triples in the set.&lt;/p&gt;

&lt;p&gt;How can we encourage clever use of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id116f1210&quot;&gt;data&lt;/a&gt;, yet not die if somebody, whether through malice, lack of understanding, or simple bad luck, submits impossible queries?&lt;/p&gt;

&lt;p&gt;Restricting the language is not the way; any language beyond text search can express queries that will take forever to execute.  Also, just returning a timeout after the first second (or whatever arbitrary time period) leaves people in the dark and does not produce an impression of responsiveness.  So we decided to allow arbitrary queries, and if a quota of time or resources is exceeded, we return partial results and indicate how much processing was done.&lt;/p&gt;

&lt;p&gt;Here we are looking for the top 10 people whom people claim to know without being known in return, like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;SQL&amp;gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;&lt;br /&gt;
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________&lt;br /&gt;
http://twitter.com/BarackObama             252
http://twitter.com/brianshaler             183
http://twitter.com/newmediajim             101
http://twitter.com/HenryRollins            95
http://twitter.com/wilw                    81
http://twitter.com/stevegarfield           78
http://twitter.com/cote                    66
mailto:adam.westerski@deri.org             66
mailto:michal.zaremba@deri.org             66
http://twitter.com/dsifry                  65&lt;br /&gt;
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:      1R rnd      0R seq      0P disk  1.346KB /      3 messages&lt;br /&gt;
SQL&amp;gt; sparql 
SELECT ?celeb, 
       COUNT (*)
WHERE { ?claimant foaf:knows ?celeb .
        FILTER (!bif:exists ( SELECT (1) 
                              WHERE { ?celeb foaf:knows ?claimant }
                            )
               )
      } 
GROUP BY ?celeb 
ORDER BY DESC 2 
LIMIT 10;&lt;br /&gt;
celeb                                      callret-1
VARCHAR                                    VARCHAR
________________________________________   _________&lt;br /&gt;
http://twitter.com/JasonCalacanis          496
http://twitter.com/Twitterrific            466
http://twitter.com/ev                      442
http://twitter.com/BarackObama             356
http://twitter.com/laughingsquid           317
http://twitter.com/gruber                  294
http://twitter.com/chrispirillo            259
http://twitter.com/ambermacarthur          224
http://twitter.com/t                       219
http://twitter.com/johnedwards             188&lt;br /&gt;
*** Error S1TAT: [Virtuoso Driver][Virtuoso Server]RC...: Returning incomplete 
results, query interrupted by result timeout.  
Activity:    329R rnd   44.6KR seq    342P disk  638.4KB /     46 messages&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;The first query read all data from disk; the second run had the working set from the first and could read some more before time ran out, hence the results were better.  But the response time was the same.&lt;/p&gt;

&lt;p&gt;If one has a query that just loops over consecutive joins, like in basic SPARQL, interrupting the processing after a set time period is simple.  But such queries are not very interesting.  To give meaningful partial answers with nested aggregation and sub-queries requires some more tricks.  The basic idea is to terminate the innermost active sub-query/aggregation at the first timeout, and extend the timeout a bit so that accumulated results get fed to the next aggregation, like from the &lt;code&gt;GROUP BY&lt;/code&gt; to the &lt;code&gt;ORDER BY&lt;/code&gt;.  If this again times out, we continue with the next outer layer.  This guarantees that results are delivered if there were any results found for which the query pattern is true.  False results are not produced, except in cases where there is comparison with a count and the count is smaller than it would be with the full evaluation.&lt;/p&gt;

&lt;p&gt;One can also use this as a basis for paid services.  The cutoff does not have to be time; it can also be in other units, making it insensitive to concurrent usage and variations of working set.&lt;/p&gt;

&lt;p&gt;This system will be deployed on our &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id11500a58&quot;&gt;Billion Triples Challenge&lt;/a&gt; &lt;a href=&quot;http://b3s.openlinksw.com/&quot; id=&quot;link-id11683120&quot;&gt;demo instance&lt;/a&gt; in a few days, after some more testing.  When Virtuoso 6 ships, all &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id1157a500&quot;&gt;LOD&lt;/a&gt; Cloud AMIs and OpenLink-hosted LOD Cloud SPARQL endpoints will have this enabled by default.  (AMI users will be able to disable the feature, if desired.)  The feature works with Virtuoso 6 in both single server and cluster deployment.&lt;/p&gt;</description></item><item><title>ISWC 2008: RDB2RDF Face-to-Face</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1477</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1477#comments</comments><pubDate>Tue, 04 Nov 2008 13:26:19 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-11-04T17:20:35-05:00</n0:modified><description>&lt;p&gt;The W3C&amp;#39;s RDB-to-&lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x153bdcf8&quot;&gt;RDF&lt;/a&gt; mapping incubator group (&lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id0x13e3e6b8&quot;&gt;RDB2RDF XG&lt;/a&gt;) met in &lt;a href=&quot;http://dbpedia.org/resource/Karlsruhe&quot; id=&quot;link-id0x15236b08&quot;&gt;Karlsruhe&lt;/a&gt; after &lt;a href=&quot;http://iswc2008.semanticweb.org/&quot; id=&quot;link-id0x2450fba8&quot;&gt;ISWC 2008&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The meeting was about writing a charter for a working group that would define a standard for mapping relational databases to RDF, either for purposes of import into RDF stores or of query mapping from &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x14c84338&quot;&gt;SPARQL&lt;/a&gt; to &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x146db368&quot;&gt;SQL&lt;/a&gt;. There was a lot of agreement and the meeting even finished ahead of the allotted time.&lt;/p&gt;

&lt;h2&gt;Whose Identifiers?&lt;/h2&gt;

&lt;p&gt;There was discussion concerning using the &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x12c15e58&quot;&gt;Entity&lt;/a&gt; Name Service from the Okkam project for assigning URIs to entities mapped from relational databases. This makes sense when talking about long-lived, legal entities, such as people or companies or geography. Of course, there are cases where this makes no sense; for example, a purchase order or maintenance call hardly needs an identifier registered with the ENS. The problem is, in practice, a CRM could mention customers that have an ENS registered ID (or even several such IDs) and others that have none. Of course, the CRM&amp;#39;s reference cannot depend on any registration. Also, even when there is a stable &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x12b7b5c0&quot;&gt;URI&lt;/a&gt; for the entity, a CRM may need a key that specifies some administrative subdivision of the customer.&lt;/p&gt;

&lt;p&gt;Also we note that an on-demand RDB-to-RDF mapping may have some trouble dealing with &amp;quot;same as&amp;quot; assertions. If names that are anything other than string forms of the keys in the system must be returned, there will have to be a lookup added to the RDB. This is an administrative issue. Certainly going over the network to ask for names of items returned by queries has a prohibitive cost. It would be good for ad hoc integration to use shared URIs when possible. The trouble of adding and maintaining lookups for these, however, makes this more expensive than just mapping to RDF and using literals for joining between independently maintained systems.&lt;/p&gt;

&lt;h2&gt;
&lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x14bf7da0&quot;&gt;XML&lt;/a&gt; or RDF?&lt;/h2&gt;

&lt;p&gt;We talked about having a language for human consumption and another for discovery and machine processing of mappings. Would this latter be XML or RDF based? Describing every detail of syntax for a mapping as RDF is really tedious. Also such descriptions are very hard to query, just as &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x1493ffc0&quot;&gt;OWL&lt;/a&gt; ontologies are. One solution is to have opaque strings embedded into RDF, just like XSLT has &lt;a href=&quot;http://dbpedia.org/resource/XPath&quot; id=&quot;link-id0x1400fe98&quot;&gt;XPath&lt;/a&gt; in string form embedded into XML. Maybe it will end up in this way here also. Having a complete XML mapping of the parse tree for mappings, XQueryX-style, could be nice for automatic generation of mappings with XSLT from an XML view of the &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x14c846d8&quot;&gt;information&lt;/a&gt; schema. But then XSLT can also produce text, so an XML syntax that has every detail of a mapping language as distinct elements is not really necessary for this.&lt;/p&gt;

&lt;p&gt;Another matter is then describing the RDF generated by the mapping in terms of RDFS or OWL. This would be a by-product of declaring the mapping. Most often, I would presume the target ontology to be given, though, reducing the need for this feature. But if RDF mapping is used for discovery of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x14f6f128&quot;&gt;data&lt;/a&gt;, such a description of the exposed data is essential.&lt;/p&gt;

&lt;h2&gt;Interoperability&lt;/h2&gt;

&lt;p&gt;We agreed with &lt;a href=&quot;http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me&quot; id=&quot;link-id0x1e776730&quot;&gt;SÃ¶ren Auer&lt;/a&gt; that we could make &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1477ad18&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s mapping language compatible with &lt;a href=&quot;http://triplify.org/&quot; id=&quot;link-id0x15514388&quot;&gt;Triplify&lt;/a&gt;. Triplify is very simple, extraction only, no SPARQL, but does have the benefit of expressing everything in SQL. As it happens, I would be the last person to tell a web developer what language to program in. So if it is SQL, then let it stay SQL. Technically, a lot of the information the Virtuoso mapping expresses is contained in the Triplify SQL statements, but not all. Some extra declarations are needed still but can have reasonable defaults.&lt;/p&gt;

&lt;p&gt;There are two ways of stating a mapping. Virtuoso starts with the triple and says which tables and columns will produce the triple. Triplify starts with the SQL statement and says what triples it produces. These are fairly equivalent. For the web developer, the latter is likely more self-evident, while the former may be more compact and have less repetition.&lt;/p&gt;

&lt;p&gt;Virtuoso and Triplify alone would give us the two interoperable implementations required from a working group, supposing the language were annotations on top of SQL. This would be a guarantee of delivery, as we would be close enough to the result from the get go.&lt;/p&gt;

&lt;h2&gt;Related Web resources&lt;/h2&gt;
&lt;ul&gt;
 &lt;li&gt;
  &lt;a href=&quot;http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSSQL2RDF&quot; id=&quot;link-id14e27040&quot;&gt;OpenLink Virtuoso: Open-Source Edition: Mapping SQL Data to RDF&lt;/a&gt;
 &lt;/li&gt;
&lt;li&gt;
  &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/pdf/Virtuoso_SQL_to_RDF_Mapping.pdf&quot; id=&quot;link-id1baad3a8&quot;&gt;Virtuoso RDF Views â Getting Started Guide (PDF)&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1467</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1467#comments</comments><pubDate>Sun, 26 Oct 2008 12:15:35 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-27T12:07:58-04:00</n0:modified><description>&lt;p&gt;&amp;quot;Physician, heal thyself,&amp;quot; it is said. We profess to say what the messaging of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1b4a25f0&quot;&gt;semantic web&lt;/a&gt; ought to be, but is our own perfect?&lt;/p&gt;

&lt;p&gt;I will here engage in some critical introspection as well as amplify on some answers given to &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1e4f9928&quot;&gt;Virtuoso&lt;/a&gt;-related questions in recent times.&lt;/p&gt;

&lt;p&gt;I use some conversations from the &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x1e6c0ca8&quot;&gt;Vienna&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1e56df88&quot;&gt;Linked Data&lt;/a&gt; Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x1e680440&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x1e140068&quot;&gt;OpenLink Data Spaces&lt;/a&gt;) applications line, &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1f4ba630&quot;&gt;OAT&lt;/a&gt; (&lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1ba4bac8&quot;&gt;OpenLink Ajax Toolkit&lt;/a&gt;), or &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1d4159b0&quot;&gt;ODE&lt;/a&gt; (&lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1e973c80&quot;&gt;OpenLink Data Explorer&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;&amp;quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&amp;quot; said &lt;a href=&quot;http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me&quot; id=&quot;link-id0x1f8bafe0&quot;&gt;SÃ¶ren Auer&lt;/a&gt;.&lt;/h3&gt;

&lt;p&gt;Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.&lt;/p&gt;

&lt;p&gt;This is why we put a lot of emphasis on Linked Data and the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x200bd1f0&quot;&gt;Data&lt;/a&gt; Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1fb03528&quot;&gt;RDF&lt;/a&gt; store.&lt;/p&gt;

&lt;p&gt;We can do this because we own our database and &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1e7dcc70&quot;&gt;SQL&lt;/a&gt; and data access middleware and have a history of connecting to any &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1e9baf18&quot;&gt;RDBMS&lt;/a&gt; out there.&lt;/p&gt;

&lt;p&gt;The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.&lt;/p&gt;

&lt;p&gt;There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1f5f6b78&quot;&gt;SPARQL&lt;/a&gt; options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x206818c8&quot;&gt;Sesame&lt;/a&gt;- and &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x202b3348&quot;&gt;Jena&lt;/a&gt;-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).&lt;/p&gt;

&lt;p&gt;Now, this message could be better reflected in our material on the web. This &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x1c82e508&quot;&gt;blog&lt;/a&gt; is a rather informal step in this direction; more is to come. For now we concentrate on delivering.&lt;/p&gt;

&lt;p&gt;The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.&lt;/p&gt;

&lt;p&gt;This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x20832510&quot;&gt;TPC-H&lt;/a&gt; database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?&lt;/p&gt;

&lt;p&gt;The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?&lt;/p&gt;

&lt;h3&gt;&amp;quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&amp;quot;&lt;/h3&gt;

&lt;p&gt;We should answer in multiple  parts.&lt;/p&gt;

&lt;p&gt;For general collateral, like web sites and documentation:&lt;/p&gt;

&lt;p&gt;The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into&lt;/p&gt;

&lt;ul&gt;  
&lt;li&gt; Data web and RDF - Host linked data, expose relational assets as linked data;&lt;/li&gt;
&lt;li&gt; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;&lt;/li&gt;
&lt;li&gt; Web Services - access all the above over standard protocols, dynamic web pages, web hosting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each point, one simple statement.  We all know what the above things mean?&lt;/p&gt;

&lt;p&gt;Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.&lt;/p&gt;

&lt;p&gt;Far as I am concerned, hosting Java or .&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x20283a88&quot;&gt;NET&lt;/a&gt; does not have to be on the front page. Also, we have no great interest in going against &lt;a href=&quot;http://dbpedia.org/resource/Apache&quot; id=&quot;link-id0x2024a068&quot;&gt;Apache&lt;/a&gt; when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.&lt;/p&gt;

&lt;p&gt;Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.&lt;/p&gt;

&lt;p&gt;Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.&lt;/p&gt;

&lt;p&gt;I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.&lt;/p&gt;

&lt;h3&gt;&amp;quot;Virtuoso is very big and very difficult&amp;quot;&lt;/h3&gt;

&lt;p&gt;One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.&lt;/p&gt;

&lt;p&gt;This gives you SQL and SPARQL out of the box.  Adding &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x1ee61058&quot;&gt;ODBC&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1b8c31c0&quot;&gt;JDBC&lt;/a&gt; clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.&lt;/p&gt;

&lt;p&gt;Now for the difficult side:&lt;/p&gt;

&lt;p&gt;Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x20a13c00&quot;&gt;PHP&lt;/a&gt; or Java or anything else. For web services, simple is best. Stick to basics. &amp;quot;The engineer is one who can invent a simple thing.&amp;quot; Use SQL statements rather than admin UI.&lt;/p&gt;

&lt;p&gt;Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.&lt;/p&gt;

&lt;p&gt;We should put this into a couple of use case oriented how-tos.&lt;/p&gt;

&lt;p&gt;Also, we should create a network of &amp;quot;friendly local virtuoso geeks&amp;quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.&lt;/p&gt;

&lt;h3&gt;&amp;quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.&lt;/p&gt;

&lt;p&gt;If one really wants to do one&amp;#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.&lt;/p&gt;

&lt;p&gt;We are talking about such things with different parties at present.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How webby are you?  What is webby?&amp;quot;&lt;/h3&gt;

&lt;p&gt;&amp;quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&amp;quot;&lt;/p&gt;

&lt;p&gt;We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant &lt;i&gt;when&lt;/i&gt; it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.&lt;/p&gt;

&lt;p&gt;Google-style crawling of everything becomes less practical if one needs to run complex &lt;i&gt;ad hoc&lt;/i&gt; queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How does OpenLink see the next five years unfolding?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Personally, I think we have the basics for the birth of a new inflection in the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1fb9ae58&quot;&gt;knowledge&lt;/a&gt; economy. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1f07c648&quot;&gt;URI&lt;/a&gt; is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1f007d60&quot;&gt;information&lt;/a&gt; can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.&lt;/p&gt;

&lt;p&gt;The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is &lt;i&gt;context&lt;/i&gt;. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.&lt;/p&gt;

&lt;p&gt;Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.&lt;/p&gt;

&lt;p&gt;Ever since the .com era, the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id0x2048e670&quot;&gt;URL&lt;/a&gt; has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.&lt;/p&gt;

&lt;p&gt;With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.&lt;/p&gt;

&lt;p&gt;For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x1c696170&quot;&gt;Linked Open Data&lt;/a&gt; cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.&lt;/p&gt;

&lt;p&gt;This makes some things possible that were hard thus far.&lt;/p&gt;

&lt;p&gt;On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.&lt;/p&gt;

&lt;p&gt;Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x206ab780&quot;&gt;Sponger&lt;/a&gt;, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.&lt;/p&gt;

&lt;p&gt;Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.&lt;/p&gt;

&lt;p&gt;Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.&lt;/p&gt;</description></item><item><title>State of the Semantic Web, Part 2 - The Technical Questions (updated)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1466</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1466#comments</comments><pubDate>Sun, 26 Oct 2008 12:02:43 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-27T11:28:14-04:00</n0:modified><description>&lt;p&gt;Here I will talk about some more technical questions that came up.  This is mostly general; &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x205901a0&quot;&gt;Virtuoso&lt;/a&gt; specific questions and answers are separate.
&lt;/p&gt;

&lt;h3&gt;&amp;quot;How to Bootstrap?  Where will the triples come from?&amp;quot;&lt;/h3&gt;

&lt;p&gt;There are already wrappers producing &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x13519ac8&quot;&gt;RDF&lt;/a&gt; from many applications. Since any structured or semi-structured &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1c93b418&quot;&gt;data&lt;/a&gt; can be converted to RDF and often there is even a pre-existing terminology for the application domain, the availability of the data &lt;i&gt;per se&lt;/i&gt; is not the concern.&lt;/p&gt;

&lt;p&gt;The triples may come from any application or database, but they will not come from the end user directly.  There was a good talk about photograph annotation in &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x1ea9d150&quot;&gt;Vienna&lt;/a&gt;, describing many ways of deriving metadata for photos.  The essential wisdom is annotating on the spot and wherever possible doing so automatically.  The consumer is very unlikely to go annotate  photos after the fact.  Further, one can infer that photos made with the same camera around the same time are from the same location.  There are other such heuristics.  In this use case, the end user does not need to see triples.  There is some benefit though in using commonly used geographical terminology for linking to other data sources.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How will one develop applications?&amp;quot;&lt;/h3&gt;

&lt;p&gt;I&amp;#39;d say one will develop them much the same way as thus far.  In &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x207fca00&quot;&gt;PHP&lt;/a&gt;, for example.  Whether one&amp;#39;s query language is &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x20a5fde0&quot;&gt;SPARQL&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1a0bb5e0&quot;&gt;SQL&lt;/a&gt; does not make a large difference in how basic web UI is made.&lt;/p&gt;

&lt;p&gt;A SPARQL end-point is no more an end-user item than a SQL command-line is.&lt;/p&gt;

&lt;p&gt;A common mistake among techies is that they think the data structure and user experience can or ought to be of the same structure.  The UI dialogs do not, for example, have to have a 1:1 correspondence with SQL tables.&lt;/p&gt;

&lt;p&gt;The idea of generating UI from data, whether relational or data-web, is so seductive that generation upon generation of developers fall for it, repeatedly.  Even I, at OpenLink, after supposedly having been around the block a couple of times made some experiments around the topic.  What does make sense is putting a thin wrapper or HTML around the application, using XSLT and such for formatting.  Since the model does allow for unforeseen properties of data, one can build a viewer for these alongside the regular forms.  For this, Ajax technologies like &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1e91d118&quot;&gt;OAT&lt;/a&gt; (the &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x174b7950&quot;&gt;OpenLink AJAX Toolkit&lt;/a&gt;) will be good.&lt;/p&gt;

&lt;p&gt;The UI ought not to completely hide the URIs of the data from the user.  It should offer a drill down to faceted views of the triples for example.  Remember when Xerox talked about graphical user interfaces in 1980? &amp;quot;Don&amp;#39;t mode me in&amp;quot; was the slogan, as I recall.&lt;/p&gt;

&lt;p&gt;Since then, we have vacillated between modal and non-modal interaction models.  Repetitive workflows like order entry go best modally and are anyway being replaced by web services.  Also workflows that are very infrequent benefit from modality; take personal network setup wizards, for example.  But enabling the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1ea14610&quot;&gt;knowledge&lt;/a&gt; worker is a domain that by its nature must retain some respect for human intelligence and not kill this by denying access to the underlying data, including provenance and URIs.  Face it: the world is not getting simpler.  It is increasingly data dependent and when this is so, having semantics and flexibility of access for the data is important.&lt;/p&gt;

&lt;p&gt;For a real-time task-oriented user interface like a fighter plane cockpit, one will not show URIs unless specifically requested.  For planning fighter sorties though, there is some potential benefit in having all data such as friendly and hostile assets, geography, organizational structure, etc., as &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x207bcd20&quot;&gt;linked data&lt;/a&gt;.  It makes for more flexible querying.  Linked data does not &lt;i&gt;per se&lt;/i&gt; mean open, so one can be joinable with open data through using the same identifiers even while maintaining arbitrary levels of security and compartmentalization.&lt;/p&gt;

&lt;p&gt;For automating tasks that every time involve the same data and queries, RDF has no intrinsic superiority.  Thus the user interfaces in places where RDF will have real edge must be more capable of &lt;i&gt;ad hoc&lt;/i&gt; viewing and navigation than regular real-time or line of business user interfaces.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x2083a6f0&quot;&gt;OpenLink Data Explorer&lt;/a&gt; idea of a &amp;quot;data behind the web page&amp;quot; view goes in this direction. Read the web as before, then hit a switch to go to the data view.  There are and will be separate clarifications and demos about this.&lt;/p&gt;

&lt;h3&gt;&amp;quot;What of the proliferation of standards?  Does this not look too tangled, no clear identity?  How would one know where to begin?&amp;quot;&lt;/h3&gt;

&lt;p&gt;When &lt;a href=&quot;http://www.w3.org/2001/sw/sweo/&quot; id=&quot;link-id0x1e8eac68&quot;&gt;SWEO&lt;/a&gt; was beginning, there was an endlessly protracted discussion of the so-called layer cake. This acronym jungle is not good messaging. Just say linked, flexibly repurpose-able data, and rich vocabularies and structure.  Just the right amount of structure for the application, less rigid and easier to change than relational.&lt;/p&gt;

&lt;p&gt;Do not even mention the different serialization formats.  Just say that it fits on top of the accepted web infrastructure â &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x1e3806b8&quot;&gt;HTTP&lt;/a&gt;, URIs, and &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x1f547288&quot;&gt;XML&lt;/a&gt; where desired.&lt;/p&gt;

&lt;p&gt;It is misleading to say inference is a box at some specific place in the diagram.  Inference of different types may or may not take place at diverse points, whether presentation or storage, on demand or as a preprocessing step.  Since there is structure and semantics, inference is possible if desired.&lt;/p&gt;

&lt;h3&gt;&amp;quot;Can I make a social network application in RDF only, with no &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x20553ee0&quot;&gt;RDBMS&lt;/a&gt;?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Yes, in principle, but what do you have in mind?  The answer is very context dependent.  The person posing the question had an E-learning system in mind, with things such as course catalogues, course material, etc.  In such a case, RDF is a great match, especially since the user count will not be in the millions.  No university has that many students and anyway they do not hang online browsing the course catalogue.&lt;/p&gt;

&lt;p&gt;On the other hand, if I think of making a social network site with RDF as the exclusive data model, I see things that would be very inefficient. For example, keeping a count of logins or the last time of login would be by default several times less efficient than with a RDBMS.&lt;/p&gt;

&lt;p&gt;If some application is really large scale and has a knowable workload profile, like any social network does, then some task-specific data structure is simply economical.  This does not mean that the application language cannot be SPARQL but this means that the storage format must be tuned to favor some operations over others, relational style.  This is a matter of cost more than of feasibility.  Ten servers cost less than a hundred and have failures ten times less frequently.&lt;/p&gt;

&lt;p&gt;In the near term we will see the birth of an application paradigm for the data web. The data will be open, exposed, first-class citizen; yet the user experience will not have to be in a 1:1 image of the data.&lt;/p&gt;</description></item><item><title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-09-30#1446</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1446#comments</comments><pubDate>Tue, 30 Sep 2008 16:24:34 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-03T06:20:48.000094-04:00</n0:modified><description>&lt;div&gt;
&lt;h2&gt;Introduction&lt;/h2&gt; 

&lt;p&gt;We use &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xb03e418&quot;&gt;Virtuoso&lt;/a&gt; 6 Cluster Edition to demonstrate the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Text and structured &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0xbd9dae8&quot;&gt;information&lt;/a&gt; based lookups&lt;/li&gt;
&lt;li&gt;Analytics queries&lt;/li&gt;
&lt;li&gt;Analysis of co-occurrence of features like interests and tags.&lt;/li&gt;
&lt;li&gt;Dealing with identity of multiple IRI&amp;#39;s (&lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0xb383dd8&quot;&gt;owl&lt;/a&gt;:sameAs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The demo is based on a set of canned &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xbda6298&quot;&gt;SPARQL&lt;/a&gt; queries that can be invoked using the &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0xbb292f0&quot;&gt;OpenLink Data Explorer&lt;/a&gt; (&lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0xc263528&quot;&gt;ODE&lt;/a&gt;) Firefox extension.&lt;/p&gt;
&lt;p&gt;The demo queries can also be run directly against the SPARQL end point.&lt;/p&gt;

&lt;p&gt;The demo is being worked on at the time of submission and may be shown online by appointment.&lt;/p&gt;

&lt;p&gt;Automatic annotation of the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa173378&quot;&gt;data&lt;/a&gt; based on &lt;a href=&quot;http://dbpedia.org/resource/Named_entity_recognition&quot; id=&quot;link-id0xbdda558&quot;&gt;named entity extraction&lt;/a&gt; is
being worked on at the time of this submission.  By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted &lt;a href=&quot;http://dbpedia.org/resource/Named_entity_recognition&quot; id=&quot;link-id0xa66fbe0&quot;&gt;named entities&lt;/a&gt; and their relationships in the &lt;a href=&quot;http://umbel.org/about/&quot; id=&quot;link-id0xa06e2c8&quot;&gt;UMBEL&lt;/a&gt; and Open
CYC ontologies.
&lt;/p&gt;

&lt;p&gt;Also examples involving owl:sameAs are being added, likewise  with similarity metrics and search hit scores.&lt;/p&gt;

&lt;h2&gt;The Data&lt;/h2&gt;

&lt;p&gt;The database consists of the billion triples data sets and some additions like Umbel.   Also the Freebase extract is newer than the challenge original.&lt;/p&gt;
&lt;p&gt;The triple count is 1115 million.&lt;/p&gt;
&lt;p&gt;In the case of web harvested resources, the data is loaded in one graph per resource.&lt;/p&gt;
&lt;p&gt;In the case of larger data sets like &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0xc2bf770&quot;&gt;Dbpedia&lt;/a&gt; or the US census, all triples of the provenance share a data set specific graph.&lt;/p&gt;
&lt;p&gt;All string literals are additionally indexed in a full text index.  No stop words are used.&lt;/p&gt;

&lt;p&gt;Most queries do not specify a graph.  Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS.  All indices ending in S are bitmap indices.
&lt;/p&gt;

&lt;h2&gt;The Queries &lt;/h2&gt;


&lt;p&gt;The demo uses Virtuoso SPARQL extensions  in most queries.  These
extensions consist on one hand of well known &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xaf8cb40&quot;&gt;SQL&lt;/a&gt; features like
aggregation with grouping and existence and value subqueries and on
the other of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xafdceb8&quot;&gt;RDF&lt;/a&gt; specific features.
The latter include  run time RDFS and OWL inferencing support  and backward
chaining subclasses and transitivity.  
&lt;/p&gt;


&lt;h3&gt;Simple Lookups&lt;/h3&gt; 

&lt;pre&gt;sparql 
select ?s ?p (bif:search_excerpt (bif:vector (&amp;#39;&lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0xbb64dd0&quot;&gt;semantic&amp;#39;, &amp;#39;web&lt;/a&gt;&amp;#39;), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, &amp;quot;&amp;#39;semantic web&amp;#39;&amp;quot;)) 
  } 
limit 10
;
&lt;/pre&gt;

&lt;p&gt;This looks up triples with semantic web in the object and makes a search hit summary of the literal, 
highlighting the search terms.
&lt;/p&gt;

&lt;pre&gt;sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) 
  } 
group by ?tp
order by desc 2
limit 40
;
&lt;/pre&gt;

&lt;p&gt;This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.&lt;/p&gt;
&lt;p&gt;What are these things called?&lt;/p&gt;

&lt;pre&gt;sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) 
  } 
group by ?lbl
order by desc 2
;
&lt;/pre&gt;

&lt;p&gt;Many of these things do not have a rdfs:label.  Let us use a more general concept of lable 
which groups dc:title, foaf:name and other name-like properties together.  The subproperties are 
resolved at run time, there is no materialization.
&lt;/p&gt;

&lt;pre&gt;sparql 
define input:inference &amp;#39;b3s&amp;#39;
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) 
  } 
group by ?lbl
order by desc 2
;
&lt;/pre&gt;

&lt;p&gt;We can list sources by the topics they contain.  
Below we look for graphs that mention terrorist bombing.
&lt;/p&gt;

&lt;pre&gt;sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, &amp;quot;&amp;#39;terrorist bombing&amp;#39;&amp;quot;)) 
      }
  } 
group by ?g 
order by desc 2
;
&lt;/pre&gt;

&lt;p&gt;Now some web 2.0 tagging of search results.  The &lt;a href=&quot;http://dbpedia.org/resource/Tag&quot; id=&quot;link-id0xa8b89f8&quot;&gt;tag&lt;/a&gt; cloud of &amp;quot;computer&amp;quot;&lt;/p&gt;

&lt;pre&gt;sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains &amp;quot;computer&amp;quot; . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;
&lt;/pre&gt;

&lt;p&gt;This query will find the posters who talk the most about sex.&lt;/p&gt;

&lt;pre&gt;sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, &amp;quot;sex&amp;quot;)) 
  } 
group by ?auth
order by desc 2
;
&lt;/pre&gt;

&lt;h3&gt;Analytics &lt;/h3&gt;

&lt;p&gt;We look for people who are joined by having relatively uncommon interests but do not know each other.&lt;/p&gt;

&lt;pre&gt;sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt &amp;gt; 1 &amp;amp;&amp;amp; ?cnt &amp;lt; 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 &amp;amp;&amp;amp; 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;amp;&amp;amp; 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;
&lt;/pre&gt;

&lt;p&gt;The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.  
It then takes people that share the interest and checks that neither claims to know the other.  
It then sorts the results rarest interest first.  The query can be written more efficently but is 
here just to show that database-wide scans of the population are possible ad hoc.
&lt;/p&gt;

&lt;p&gt;Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page.  This showcases the use of SQL together 
with SPARQL.  The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is 
much more efficiently done in SQL, specially since it gets updated as the data changes.  
This is an example of materialized intermediate results based on warehoused RDF.
&lt;/p&gt;

&lt;pre&gt;create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
&lt;/pre&gt;

&lt;p&gt;How many times each topic is mentioned?&lt;/p&gt;

&lt;pre&gt;
insert into tag_count 
  select * 
    from (sparql define output:valmode &amp;quot;LONG&amp;quot; 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);
&lt;/pre&gt;

&lt;p&gt;Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &amp;lt; that of t2.&lt;/p&gt;

&lt;pre&gt;insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;, cnt 
    from 
      (select  &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;, count (*) as cnt 
         from 
           (sparql define output:valmode &amp;quot;LONG&amp;quot;
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where &amp;quot;t1&amp;quot; &amp;lt; &amp;quot;t2&amp;quot; 
         group by &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;) xx
    where isiri_id (&amp;quot;t1&amp;quot;) and 
          isiri_id (&amp;quot;t2&amp;quot;) 
    option (quietcast); 
&lt;/pre&gt;

&lt;p&gt;Now put the individual occurrence counts into the same table with the co-occurrence.  This 
denormalization makes the related tags lookup faster.
&lt;/p&gt;


&lt;pre&gt;update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
&lt;/pre&gt;

&lt;p&gt;Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.  
A single select will return a Technorati-style related tags listing.
&lt;/p&gt;

&lt;p&gt;To show the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x9d4bc60&quot;&gt;URI&lt;/a&gt;&amp;#39;s of the tags:
&lt;/p&gt;

&lt;pre&gt;select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;
&lt;/pre&gt;

&lt;h3&gt;Social Networks &lt;/h3&gt;

&lt;p&gt;We look at what interests people have &lt;/p&gt;

&lt;pre&gt;sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt &amp;gt; 100) 
  } 
order by desc 2 
limit 100
;
&lt;/pre&gt;

&lt;p&gt;Now the same for the Harry Potter fans &lt;/p&gt;

&lt;pre&gt;sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest &amp;lt;&lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0xba0b390&quot;&gt;http&lt;/a&gt;://www.livejournal.com/interests.bml?int=harry+potter&amp;gt; .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;
&lt;/pre&gt;

&lt;p&gt;We see whether knows relations are symmmetrical.  We return the top n people that others claim to know without being reciprocally known.&lt;/p&gt;

&lt;pre&gt;sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;
&lt;/pre&gt;

&lt;p&gt;We look for a well connected person to start from.&lt;/p&gt;

&lt;pre&gt;sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;
&lt;/pre&gt;

&lt;p&gt;We look for the most connected of the many online identities of Stefan Decker.&lt;/p&gt;

&lt;pre&gt;sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, &amp;quot;&amp;#39;Stefan Decker&amp;#39;&amp;quot;)) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;
&lt;/pre&gt;

&lt;p&gt;We count the transitive closure of Stefan Decker&amp;#39;s connections &lt;/p&gt;

&lt;pre&gt;sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &amp;lt;mailto:stefan.decker@deri.org&amp;gt;)
  }
;
&lt;/pre&gt;

&lt;p&gt;Now we do the same while following owl:sameAs links.&lt;/p&gt;

&lt;pre&gt;sparql 
define input:same-as &amp;quot;yes&amp;quot;
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &amp;lt;mailto:stefan.decker@deri.org&amp;gt;)
  }
;
&lt;/pre&gt;

&lt;h2&gt;Demo System&lt;/h2&gt; 

&lt;p&gt;The system runs on Virtuoso 6 Cluster Edition.  The database is partitioned into 12 partitions, 
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 
machines, each with  2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes 
and corresponding partitions can be spread over a larger number of machines.  If each ran on its 
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for 
search engine or fast analytics applications. Most of the demonstrated queries run in memory on 
second invocation. The timing difference between first and second run is easily an order of 
magnitude.
&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>BSBM With Triples and Mapped Relational Data</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-08-06#1410</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1410#comments</comments><pubDate>Wed, 06 Aug 2008 19:41:50 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-08-06T16:29:44.000003-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;BSBM With Triples and Mapped Relational Data&lt;/div&gt;
&lt;p&gt;The special contribution of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id10039db0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id106b2538&quot;&gt;BSBM&lt;/a&gt;) to the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id101a75f8&quot;&gt;RDF&lt;/a&gt; world is to raise the question of doing OLTP with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xae54170&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1e847b08&quot;&gt;BSBM&lt;/a&gt; also specifies a relational schema and can generate the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id1206c378&quot;&gt;data&lt;/a&gt; as either triples or &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1667f040&quot;&gt;SQL&lt;/a&gt; inserts.&lt;/p&gt;

&lt;p&gt;The benchmark effectively simulates the case of exposing an existing &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id10a93518&quot;&gt;RDBMS&lt;/a&gt; as RDF.  &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id13e46d80&quot;&gt;OpenLink Software&lt;/a&gt; calls this &lt;i&gt;RDF Views&lt;/i&gt;.  &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id12027578&quot;&gt;Oracle&lt;/a&gt; is beginning to call this &lt;i&gt;semantic covers&lt;/i&gt;.  The &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id161dc678&quot;&gt;RDB2RDF XG&lt;/a&gt;, a W3C incubator group, has been active in this area since Spring, 2008.&lt;/p&gt;

&lt;h3&gt;But why an OLTP workload with RDF to begin with?&lt;/h3&gt;

&lt;p&gt;We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1e7119d8&quot;&gt;data&lt;/a&gt; is online for human consumption, it may be online via a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id106a8908&quot;&gt;SPARQL&lt;/a&gt; end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.&lt;/p&gt;

&lt;p&gt;Warehousing all the world&amp;#39;s publishable data as RDF is not our first preference, nor would it be the publisher&amp;#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&amp;#39;ll do here.&lt;/p&gt;

&lt;h3&gt;What We Got &lt;/h3&gt;

&lt;p&gt;First, we found that &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400&quot; id=&quot;link-id150ea748&quot;&gt;making the query plan took much too long&lt;/a&gt; in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.&lt;/p&gt;

&lt;p&gt;But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xae5aff0&quot;&gt;SPARQL&lt;/a&gt; could not be directly translated.&lt;/p&gt;

&lt;p&gt;If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!&lt;/p&gt;

&lt;p&gt;We filled two &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id12dbdc70&quot;&gt;Virtuoso&lt;/a&gt; instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &amp;quot;query mixes per hour&amp;quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)&lt;/p&gt;

&lt;p&gt;With the unmodified benchmark we got:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt;1297 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt;&lt;b&gt;3144 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)&lt;/p&gt;

&lt;p&gt;The following were measured on the second run of a 100 query mix series, single test driver, warm cache.&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt; 5746 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt; &lt;b&gt;7525 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt; 19459 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt; &lt;b&gt;24531 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.&lt;/p&gt;

&lt;p&gt;The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the &lt;code&gt;StopCompilerWhenXOverRunTime = 1&lt;/code&gt; option here to cut needless compiler overhead, the queries being straightforward enough.&lt;/p&gt;

&lt;p&gt;We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.&lt;/p&gt;

&lt;h3&gt;Suggestions for BSBM&lt;/h3&gt;

&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Reporting Rules.&lt;/b&gt; The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.&lt;/p&gt;
 &lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Multiuser operation.&lt;/b&gt;  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Add business intelligence.&lt;/b&gt;  SPARQL has aggregates now, at least with &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id11a25ac0&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xb003180&quot;&gt;Virtuoso&lt;/a&gt;, so let&amp;#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &amp;quot;customers who bought this also bought xxx.&amp;quot;&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;For the SPARQL community&lt;/b&gt;, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id109e2448&quot;&gt;SPARQL protocol&lt;/a&gt; extension; the SPARUL syntax should also have a way of calling a procedure.  Something like &lt;code&gt;select proc (??, ??)&lt;/code&gt; would be enough, where &lt;code&gt;??&lt;/code&gt; is a parameter marker, like &lt;code&gt;?&lt;/code&gt; in &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id13febf48&quot;&gt;ODBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id120416a8&quot;&gt;JDBC&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Add transactions.&lt;/b&gt;Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Comments on Query Mix&lt;/h3&gt;

&lt;p&gt;The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.&lt;/p&gt;

&lt;h2&gt;Next&lt;/h2&gt;

&lt;p&gt;We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>ESWC 2008</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1379</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1379#comments</comments><pubDate>Mon, 09 Jun 2008 14:02:16 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-06-11T13:15:33-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;ESWC 2008&lt;/div&gt;
&lt;p&gt;YrjÃ¤nÃ¤ Rankka and I attended &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id10b7a038&quot;&gt;ESWC2008&lt;/a&gt; on behalf of OpenLink.&lt;/p&gt;
&lt;p&gt;We were invited at the last minute to give a &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id105df758&quot;&gt;Linked Open Data&lt;/a&gt; talk at Paolo Bouquet&amp;#39;s Identity and Reference workshop. We also had a demo of &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id12eacca0&quot;&gt;SPARQL&lt;/a&gt; BI (&lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/ESWC2008%20SPARQL%20BI%20OpenLink.ppt&quot; id=&quot;link-id10b43e58&quot;&gt;PPT&lt;/a&gt;); &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations&quot; id=&quot;link-id1116d8f0&quot;&gt;other formats coming soon&lt;/a&gt;), our business intelligence extensions to &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x16c9bfc8&quot;&gt;SPARQL&lt;/a&gt; as well as joining between relational &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10badc40&quot;&gt;data&lt;/a&gt; mapped to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id108edaf8&quot;&gt;RDF&lt;/a&gt; and native &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x181a5ed8&quot;&gt;RDF&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x17e69910&quot;&gt;data&lt;/a&gt;. i was also speaking at the social networks panel chaired by Harry Halpin.&lt;/p&gt;
&lt;p&gt;I have gathered a few impressions that I will share in the next few posts (&lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1375&quot; id=&quot;link-id107298e0&quot;&gt;1 - RDF Mapping&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1376&quot; id=&quot;link-id10b3a530&quot;&gt;2 - DARQ&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-id107290e0&quot;&gt;3 - voiD&lt;/a&gt;, &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1378&quot; id=&quot;link-id1071a950&quot;&gt;4 - Paradigmata&lt;/a&gt;). &lt;i&gt;Caveat: This is not meant to be complete or impartial press coverage of the event but rather some quick comments on issues of personal/OpenLink interest. The fact that I do not mention something does not mean that it is unimportant.&lt;/i&gt;
&lt;/p&gt;
&lt;h2&gt;The voiD Graph&lt;/h2&gt;
&lt;p&gt;
  &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x1a87f110&quot;&gt;Linked Open Data&lt;/a&gt; was well represented, with Chris Bizer, Tom Heath, ourselves and many others. The great advance for &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id108f3c48&quot;&gt;LOD&lt;/a&gt; this time around is &lt;a href=&quot;http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08&quot; id=&quot;link-id10df9830&quot;&gt;voiD, the Vocabulary of Interlinked Datasets&lt;/a&gt;, a means to describe what in fact is inside the &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x1a089980&quot;&gt;LOD&lt;/a&gt; cloud, how to join it with what and so forth. Big time important if there is to be a &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-iddf74578&quot;&gt;web of federatable data sources&lt;/a&gt;, feeding directly into what we have been saying for a while about SPARQL end-point self-description and discovery. There is reasonable hope of having something by the date of &lt;a href=&quot;http://www.linkeddataplanet.com/&quot; id=&quot;link-id10dd0848&quot;&gt;Linked Data Planet&lt;/a&gt; in a couple of weeks.&lt;/p&gt;
&lt;h2&gt;Federating&lt;/h2&gt;
&lt;p&gt;Bastian Quilitz gave a talk about his &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id108746e8&quot;&gt;DARQ&lt;/a&gt;, a federated version of Jena&amp;#39;s ARQ.&lt;/p&gt;
&lt;p&gt;Something like &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id0x1a2d9860&quot;&gt;DARQ&lt;/a&gt;&amp;#39;s optimization statistics should make their way into the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id10992348&quot;&gt;SPARQL protocol&lt;/a&gt; as well as the voiD data set description.&lt;/p&gt;
&lt;p&gt;We really need federation but more on this in &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1376&quot; id=&quot;link-id1059d688&quot;&gt;a separate post&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
&lt;a href=&quot;http://xsparql.deri.ie/&quot; id=&quot;link-id10314308&quot;&gt;XSPARQL&lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;Axel Polleres et al had a paper about &lt;a href=&quot;http://xsparql.deri.ie/&quot; id=&quot;link-id0x1ad77490&quot;&gt;XSPARQL&lt;/a&gt;, a merge of &lt;a href=&quot;http://dbpedia.org/resource/XQuery&quot; id=&quot;link-id10b98e90&quot;&gt;XQuery&lt;/a&gt; and SPARQL. While visiting DERI a couple of weeks back and again at the conference, we talked about OpenLink implementing the spec. It is evident that the engines must be in the same process and not communicate via the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x17e75190&quot;&gt;SPARQL protocol&lt;/a&gt; for this to be practical. We could do this. We&amp;#39;ll have to see when.&lt;/p&gt;
&lt;p&gt;Politically, using &lt;a href=&quot;http://dbpedia.org/resource/XQuery&quot; id=&quot;link-id0x18a9bf10&quot;&gt;XQuery&lt;/a&gt; to give expressions and XML synthesis to SPARQL would be fitting. These things are needed anyhow, as surely as aggregation and sub-queries but the latter would not so readily come from XQuery. Some rapprochement between RDF and XML folks is desirable anyhow.&lt;/p&gt;
&lt;h2&gt;Panel: Will the Sem Web Rise to the Challenge of the Social Web?&lt;/h2&gt;
&lt;p&gt;The social web panel presented the question of whether the sem web was ready for prime time with data portability.&lt;/p&gt;
&lt;p&gt;The main thrust was expressed in Harry Halpin&amp;#39;s rousing closing words: &amp;quot;Men will fight in a battle and lose a battle for a cause they believe in. Even if the battle is lost, the cause may come back and prevail, this time changed and under a different name. Thus, there may well come to be something like our &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id122f4da0&quot;&gt;semantic web&lt;/a&gt;, but it may not be the one we have worked all these years to build if we do not rise to the occasion before us right now.&amp;quot;&lt;/p&gt;
&lt;p&gt;So, how to do this? Dan Brickley asked the audience how many supported, or were aware of, the latest Web 2.0 things, such as &lt;a href=&quot;http://dbpedia.org/page/OAuth&quot; id=&quot;link-idf300bc0&quot;&gt;OAuth&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/page/OpenID&quot; id=&quot;link-id10ce7a40&quot;&gt;OpenID&lt;/a&gt;. A few were. The general idea was that research (after all, this was a research event) should be more integrated and open to the world at large, not living at the &amp;quot;outdated pace&amp;quot; of a 3 year funding cycle. Stefan Decker of DERI acquiesced in principle. Of course there is impedance mismatch between specialization and interfacing with everything.&lt;/p&gt;
&lt;p&gt;I said that triples and vocabularies existed, that OpenLink had &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id1210dbf8&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id11076be8&quot;&gt;OpenLink Data Spaces&lt;/a&gt;, &lt;a href=&quot;http://community.linkeddata.org/&quot; id=&quot;link-id10d46710&quot;&gt;Community LinkedData&lt;/a&gt;) for managing one&amp;#39;s data-web presence, but that scale would be the next thing. Rather large scale even, with 100 gigatriples (Gtriples) reached before one even noticed. It takes a lot of PCs to host this, maybe $400K worth at today&amp;#39;s prices, without replication. Count 16G ram and a few cores per Gtriple so that one is not waiting for disk all the time.&lt;/p&gt;
&lt;p&gt;The tricks that Web 2.0 silos do with app-specific data structures and app-specific partitioning do not really work for RDF without compromising the whole point of smooth schema evolution and tolerance of ragged data.&lt;/p&gt;
&lt;p&gt;So, simple vocabularies, minimal inference, minimal blank nodes. Besides, note that the inference will have to be done at run time, not forward-chained at load time, if only because users will not agree on what sameAs and other declarations they want for their queries. Not to mention spam or malicious sameAs declarations!&lt;/p&gt;
&lt;p&gt;As always, there was the question of business models for the open data web and for semantic technologies in general. As we see it, &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id108b7688&quot;&gt;information&lt;/a&gt; overload is the factor driving the demand. Better contextuality will justify semantic technologies. Due to the large volumes and complex processing, a data-as-service model will arise. The data may be open, but its query infrastructure, cleaning, and keeping up-to-date, can be monetized as services.&lt;/p&gt;
&lt;h2&gt;Identity and Reference&lt;/h2&gt;
&lt;p&gt;For the identity and reference workshop, the ultimate question is metaphysical and has no single universal answer, even though people, ever since the dawn of time and earlier, have occupied themselves with the issue. Consequently, I started with the Genesis quote where Adam called things by &lt;i&gt;nominibus suis&lt;/i&gt;, off-hand implying that things would have some intrinsic ontologically-due names. This would be among the older references to the question, at least in widely known sources.&lt;/p&gt;
&lt;p&gt;For present purposes, the consensus seemed to be that what would be considered the same as something else depended entirely on the application. What was similar enough to warrant a sameAs for cooking purposes might not warrant a sameAs for chemistry. In fact, complete and exact sameness for URIs would be very rare. So, instead of making generic weak similarity assertions like similarTo or seeAlso, one would choose a set of strong sameAs assertions and have these in effect for query answering if they were appropriate to the granularity demanded by the application.&lt;/p&gt;
&lt;p&gt;Therefore sameAs is our permanent companion, and there will in time be malicious and spam sameAs. So, nothing much should be materialized on the basis of sameAs assertions in an &lt;a href=&quot;http://dbpedia.org/resource/Open_world_assumption&quot; id=&quot;link-id10c4dfd0&quot;&gt;open world&lt;/a&gt;. For an app-specific warehouse, sameAs can be resolved at load time.&lt;/p&gt;
&lt;p&gt;There was naturally some apparent tension between the Occam camp of &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id105fd240&quot;&gt;entity&lt;/a&gt; name services and the LOD camp. I would say that the issue is more a perceived polarity than a real one. People will, inevitably, continue giving things names regardless of any centralized authority. Just look at natural language. But having a dictionary that is commonly accepted for established domains of discourse is immensely helpful.&lt;/p&gt;
&lt;h2&gt;CYC and NLP&lt;/h2&gt;
&lt;p&gt;The semantic search workshop was interesting, especially CYC&amp;#39;s presentation. CYC is, as it were, the grand old man of &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id10568158&quot;&gt;knowledge&lt;/a&gt; representation. Over the long term, I would have support of the CYC inference language inside a database query processor. This would mostly be for repurposing the huge &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1acff9d0&quot;&gt;knowledge&lt;/a&gt; base for helping in search type queries. If it is for transactions or financial reporting, then queries will be &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id130a0a80&quot;&gt;SQL&lt;/a&gt; and make little or no use of any sort of inference. If it is for summarization or finding things, the opposite holds. For scaling, the issue is just making correct cardinality guesses for query planning, which is harder when inference is involved. We&amp;#39;ll see.&lt;/p&gt;
&lt;p&gt;I will also have a closer look at natural language one of these days, quite inevitably, since &lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id10795828&quot;&gt;Zitgist&lt;/a&gt; (for example) is into &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x18a12918&quot;&gt;entity&lt;/a&gt; disambiguation.&lt;/p&gt;
&lt;h2&gt;Scale&lt;/h2&gt;
&lt;p&gt;Garlic gave a talk about their Data Patrol and QDOS. We agree that storing the data for these as triples instead of 1000 or so constantly changing relational tables could well make the difference between next-to-unmanageable and efficiently adaptive.&lt;/p&gt;
&lt;p&gt;Garlic probably has the largest triple collection in constant online use to date. We will soon join them with our hosting of the whole LOD cloud and &lt;a href=&quot;http://sindice.org/&quot; id=&quot;link-id0x17f18a38&quot;&gt;Sindice&lt;/a&gt;/&lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id0x184e9e90&quot;&gt;Zitgist&lt;/a&gt; as triples.&lt;/p&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;There is a mood to deliver applications. Consequently, scale remains a central, even the principal topic. So for now we make bigger centrally-managed databases. At the next turn around the corner we will have to turn to federation. The point here is that a planetary-scale, centrally-managed, online system can be made when the workload is uniform and anticipatable, but if it is free-form queries and complex analysis, we have a problem. So we move in the direction of federating and charging based on usage whenever the workload is more complex than making simple lookups now and then.&lt;/p&gt;
&lt;p&gt;For the &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id1026ac28&quot;&gt;Virtuoso&lt;/a&gt; roadmap, this changes little. Next we make data sets available on Amazon EC2, as widely promised at ESWC. With big scale also comes rescaling and repartitioning, so this gets additional weight, as does further parallelizing of single user workloads. As it happens, the same medicine helps for both. At &lt;a href=&quot;http://www.linkeddataplanet.com/&quot; id=&quot;link-id0x17ff5c20&quot;&gt;Linked Data Planet&lt;/a&gt;, we will make more announcements.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>On Sem Web Search</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1349</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1349#comments</comments><pubDate>Tue, 29 Apr 2008 14:37:21 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-02T11:37:12.000008-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;On Sem Web Search&lt;/div&gt;
&lt;p&gt;
&lt;i&gt;&amp;quot;I give the search keywords and you give me a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1a603f18&quot;&gt;SPARQL&lt;/a&gt; end-point and a query that will get the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1bda5c00&quot;&gt;data&lt;/a&gt;.&amp;quot;&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;Thus did one SPARQL user describe the task of a semantic/data web search engine.&lt;/p&gt;
&lt;p&gt;In &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1336&quot; id=&quot;link-idff98750&quot;&gt;a previous post&lt;/a&gt;, I suggested that if the data web were the size of the document web, we&amp;#39;d be looking at two orders of magnitude more search complexity. It just might be so.&lt;/p&gt;
&lt;p&gt;In the conversation, I pointed out that a search engine might have a copy of everything and even a capability to do SPARQL and full text on it all, yet still the users would be better off doing the queries against the SPARQL end-points of the data publishers. It is a bit like the fact that not all web browsing runs off Google&amp;#39;s cache. With the data web, the point is even more pronounced, as serving a hit from Google&amp;#39;s cache is a small operation but a complex query might be a very large one.&lt;/p&gt;
&lt;p&gt;Yet, the data web is about ad-hoc joining between data sets of different origins. Thus a search engine of the data web ought to be capable of joining also, even if large queries ought to be run against individual publishers&amp;#39; end-points or the user&amp;#39;s own data warehouse.&lt;/p&gt;
&lt;p&gt;For ranking, the general consensus was that no single hit-ranking would be good for the data web. Thus word frequency-based hit-scores are OK for text hits but more is not obvious. I would think that some link analysis could apply but this will take some more experimentation.&lt;/p&gt;
&lt;p&gt;For search summaries, if we have splitting of data sets into small fragments &lt;i&gt;Ã  la&lt;/i&gt; &lt;a href=&quot;http://sindice.com/&quot; id=&quot;link-id0x1d2b7288&quot;&gt;Sindice&lt;/a&gt;, search summaries are pretty much the same as with just text search. If we store triples, then we can give text style summaries of text hits in literals and Fresnel lens views of the structured data around the literal. For showing a page of hits, the lenses must abbreviate heavily but this is still feasible. The engine would know about the most common ontologies and summarize instance data accordingly.&lt;/p&gt;
&lt;p&gt;Chris Bizer pointed out that trust and provenance are critical, especially if an answer is arrived at by joining multiple data sets. The trust of the conclusion is no greater than that of the weakest participating document. Different users will have different trusted sources.&lt;/p&gt;
&lt;p&gt;A mature data web search engine would combine a provenance/trust specification, a search condition consisting of SPARQL or full text or both, and a specification for hit rank. Again, most searches would use defaults, but these three components should in principle be orthogonally specifiable.&lt;/p&gt;
&lt;p&gt;Many places may host the same data set either for download or SPARQL access. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1b2317d0&quot;&gt;URI&lt;/a&gt; of the data set is not its &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id0x1c55dd68&quot;&gt;URL&lt;/a&gt;. Different places may further host multiple data sets on one end-point. Thus the search engine ought to return all end-points where the set is to be found. The end-points themselves ought to be able to say what data sets they contain, under what graph IRIs. Since there is no consensus about end-point self description, this too would be left to the search engine. In practice, this could be accomplished by extending Sindice&amp;#39;s semantic site map specification. A possible query would be to find an end-point containing a set of named data sets. If none were found, the search engine itself could run a query joining all the sets since it at least would hold them all.&lt;/p&gt;
&lt;p&gt;Since many places will host sets like Wordnet or Uniprot, indexing these once for each copy hardly makes sense. Thus a site should identify its data by the data set&amp;#39;s URI and not the copy&amp;#39;s URL.&lt;/p&gt;
&lt;p&gt;It came up in the discussion that search engines should share a ping format so that a single message format would be enough to notify any engine about data being updated. This is already partly the case with Sindice and &lt;a href=&quot;http://www.pingthesemanticweb.com/&quot; id=&quot;link-id0xa405ebd0&quot;&gt;PTSW&lt;/a&gt; (&lt;a href=&quot;http://www.pingthesemanticweb.com/&quot; id=&quot;link-id0x1c051a00&quot;&gt;PingTheSemanticWeb&lt;/a&gt;) sharing a ping format. &lt;/p&gt;
&lt;p&gt;Further, since it is no trouble to publish a copy of the 45G Uniprot file but a fair amount of work to index it, search engines should be smart about processing requests to index things, since these can amount to a denial of service attack. &lt;/p&gt;
&lt;p&gt;Probably very large data sets should be indexed only in the form supplied by their publisher, and others hosting copies would just state that they hold a copy. If the claim to the copy proved false, users could complain and the search engine administrator would remove the listing. It seems that some manual curating cannot be avoided here. &lt;/p&gt;
&lt;h2&gt;On Data Web Search Business Model&lt;/h2&gt;
&lt;p&gt;It seems there can be an overlap between the data web search and the data web hosting businesses. For example, Talis rents space for hosting &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1a60c7e0&quot;&gt;RDF&lt;/a&gt; data with SPARQL access. A search engine should offer basic indexing of everything for free, but could charge either data publishers or end users for running SPARQL queries across data sets. These do not have the nicely anticipatable and fairly uniform resource consumption of text lookups. In this manner, a search provider could cost-justify the capacity for allowing arbitrary queries. &lt;/p&gt;
&lt;p&gt;The value of the data web consists of unexpected joining. Such joining takes place most efficiently if the sources are at least in some proximity, for example in the same data center. Thus the search provider could monetize functioning as the database provider for mesh-ups. In the document web, publishing pages is very simple and there is no great benefit from co-locating search and pages, rather the opposite. For the data web, the hosting with SPARQL and all is more complex and resembles providing search. Thus providing search can combine with providing SPARQL hosting, once we accept in principle that search should have arbitrary inter-document joining, even if it is at an extra premium.&lt;/p&gt;
&lt;p&gt;The present search business model is advertising. If the data web is to be accessed by automated agents such as mesh-up code, display of ads is not self-evident. This is quite separate from the fact that semantics can lead to better ad targeting.&lt;/p&gt;
&lt;p&gt;One model would be to do text lookups for free from a regular web page but show ads, just a la Google search ads. Using the service via web services for text or SPARQL would have a cost paid by the searching or publishing party and would not be financed by advertising.&lt;/p&gt;
&lt;p&gt;In the case of data used in value-add data products (mesh-ups) that have financial value to their users, the original publisher of the data could even be paid for keeping the data up-to-date. This would hold for any time-sensitive feeds like news or financial feeds. Thus the hosting/search provider would be a broker of data-use fees and the data producer would be in the position of an AdSense inventory owner, i.e., a web site which shows AdSense ads. Organizing this under a hub providing back-office functions similar to an ad network could make sense even if the actual processing were divided among many sites.&lt;/p&gt;
&lt;p&gt;Kingsley has repeatedly formulated the core value proposition of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x3728a2f8&quot;&gt;semantic web&lt;/a&gt; in terms of dealing with &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1bbcbeb8&quot;&gt;information&lt;/a&gt; overload: There is the real-time enterprise and the real-time individual and both are beasts of perception. Their image is won and lost in the &lt;a href=&quot;http://dbpedia.org/resource/Internet&quot; id=&quot;link-id0x1843b020&quot;&gt;Internet&lt;/a&gt; online conversation space. We know that allegations, even if later proven false, will stick if left unchallenged. The function of semantics on the web is to allow one to track and manage where one stands. In fact, Garlik has made a business of just this, but now from a privacy and security angle. The &lt;a href=&quot;http://www.garlik.com/&quot; id=&quot;link-id0x1aa76ab0&quot;&gt;Garlik DataPatrol&lt;/a&gt; harvests data from diverse sources and allows assessing vulnerability to identity theft, for example.&lt;/p&gt;
&lt;p&gt;If one is in the business of collating all the structured data in the world, as a data web search engine is, then providing custom alerts for both security or public image management is quite natural. This can be a very valuable service if it works well.&lt;/p&gt;
&lt;p&gt;At OpenLink, we will now experiment with the Sindice/&lt;a href=&quot;http://zitgist.com/about/&quot; id=&quot;link-id0x18800228&quot;&gt;Zitgist&lt;/a&gt;/PingTheSemanticWeb content. This is a regular part of the productization of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1adf39c8&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s cluster edition. We expect to release some results in the next 4 weeks.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>RDBMS to RDF Mapping Workshop, and Benchmarks</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-11-21#1271</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1271#comments</comments><pubDate>Wed, 21 Nov 2007 13:07:03 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-25T16:29:53-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;RDBMS to RDF Mapping Workshop, and Benchmarks&lt;/div&gt;
&lt;p&gt;I was recently in Boston for the &lt;a href=&quot;http://www.w3.org/2007/03/RdfRDB/&quot; id=&quot;link-id10f990b0&quot;&gt;Mapping Relational Data to RDF workshop&lt;/a&gt; of the W3C.&lt;/p&gt; 
&lt;p&gt;The common feeling was that mapping everything to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1c343278&quot;&gt;RDF&lt;/a&gt; and querying it in terms of a generic domain ontology, mapped on demand into whatever line of business systems, would be very good if it only could be done. However, since this is not so easily done, the next best is to extract the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xb6f01d0&quot;&gt;data&lt;/a&gt; and then warehouse it as RDF.&lt;/p&gt; 
&lt;p&gt;The obstacles perceived were of the following types:&lt;/p&gt; 
&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;Lack of quality in the data. The different line of business systems do not in and of themselves hold enough semantics. If the meaning of data columns in relational tables were really known and explicit, these could be meaningfully used for joining across systems. But this is more complex than just mapping the metal &lt;i&gt;lead&lt;/i&gt; to the chemical symbol &lt;i&gt;Pb&lt;/i&gt; and back.&lt;/p&gt;
 &lt;/li&gt;
&lt;li&gt;
  &lt;p&gt;Lack of performance in RDF storage. Data sets even in the tens-of-millions of triples do not run very well in some stores. Well, we had the Banff life sciences demo with 450M triples in a small server box running &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1ca1c488&quot;&gt;Virtuoso&lt;/a&gt;, so this is not universal, plus of course we are coming up with a whole different order of magnitude, as often discussed on this &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0xb4dc850&quot;&gt;blog&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
  &lt;p&gt;Lack of functionality in mapping and possibly lack of pushing through enough of the query processing to the underlying data stores.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Personally, I am quite aware of what to do with regard to performance of mapping and storage, and see these as eminently solvable issues. After all, we have a great investment of talent in databases in general and it can be well deployed towards RDF, as we have been doing these past couple of years. So we talk about the promise of a 360-degree view of &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1ae64448&quot;&gt;information&lt;/a&gt;, with RDF being the top layer. Everybody agrees that this is a nice concept. But this is a nice concept especially when it can do the things that are the most common baseline expectation of any regular DBMS, i.e., aggregation, grouping, sub-queries, VIEWs. Now, I would not go sell a DBMS that has no &lt;code&gt;COUNT&lt;/code&gt; operator to a data warehousing shop.&lt;/p&gt; 
&lt;p&gt;The fact that OpenLink and &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x1aa10fd8&quot;&gt;Oracle&lt;/a&gt; allow RDF inside &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xa26d330&quot;&gt;SQL&lt;/a&gt;, and OpenLink even adds native aggregates and grouping to &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1d81d990&quot;&gt;SPARQL&lt;/a&gt;, fixes the problem with regard to specific products, but leaves the standardization issue open. Of course, any vendor will solve these questions one way or another because a database with no aggregation is a non-starter.&lt;/p&gt; 
&lt;p&gt;I talked to Lee Feigenbaum, chair of the W3C DAWG, about the question of aggregates and general BI capabilities in SPARQL. He told me that, prior to his time with the DAWG, these were left out because they conflicted with the &lt;a href=&quot;http://dbpedia.org/resource/Open_world_assumption&quot; id=&quot;link-id0x1be3ab98&quot;&gt;open-world&lt;/a&gt; assumption around RDF: You cannot count a set because by definition you do not know that you have all the members, the world being open and all that.&lt;/p&gt; 
&lt;p&gt;Say what? Talk about the road to hell being paved with good intentions. Now, this is in no way Lee&amp;#39;s or the present day DAWG&amp;#39;s fault; as a member myself, I can attest to the good work and would under no circumstances wish any delays or revisions to SPARQL at this point. I am just pointing out a matter that all implementations should address, as a sort of precondition of entry into the real world IS space. If this can be done interoperably, so much the better.&lt;/p&gt; 
&lt;p&gt;Now, out of the deliberations at the Boston workshop arose at least two ideas for follow-up activity.&lt;/p&gt; 
&lt;p&gt;The first was an incubator group for RDF store and mapping benchmarking. This is very appropriate in order to dispel the bad name RDF storage and querying performance has been saddled with. As a first step in this direction, I will outline a &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1269&quot; id=&quot;link-id10306200&quot;&gt;social web oriented benchmark&lt;/a&gt; on this blog.&lt;/p&gt; 
&lt;p&gt;The second activity was an &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id10150a58&quot;&gt;incubator group for preparing standardization of mapping methodologies from relational schemas to RDF&lt;/a&gt;. We will be active on this as well.&lt;/p&gt; 
&lt;p&gt;The two offshoots appear logically separate but are not necessarily so in practice. A benchmark is after all something that is supposed to promote a technology to a user base. The user base seems to wish to put all online systems and data warehouses under a common top level RDF model and then query away, introducing no further replication of data or performance cost or ETL latencies.&lt;/p&gt; 
&lt;p&gt;Updating would also be nice but even query only would be very good. Personally, I&amp;#39;d say the RDF strength is all on the query side. Transactions are taken care of well enough by what there already is, RDF stands out in integration and the ad-hoc and discovery side of the matter. Given this, we expect the value to be consumed in a heterogeneous, multi-database, federated environment. Thus a benchmark should measure this aspect of the use-case. With the right mapping and queries, we could probably demonstrate the added cost of RDF to be very low, as long as we could push all queries that can be answered by a single source to the responsible DBMS. For distributed joins, we are back at the question of optimizing distributed queries but this is a familiar one and RDF is not the principal cost factor.&lt;/p&gt; 
&lt;p&gt;The subject does become quite complex at this point. We would have to take supposedly representative synthetic OLTP and BI data sets (like the ones in TPC-D, TPC-E, and &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0xb576e78&quot;&gt;TPC-H&lt;/a&gt;), and invent queries across them that would both make sense and be implementable in SPARQL extended with aggregates and sub-queries. Reliance on SPARQL extensions is simply unavoidable. Setting up the test systems would be non-trivial, even though there is a lot of industry experience in these matters on the database side.&lt;/p&gt; 
&lt;p&gt;So, while this is probably the benchmark most relevant to the target audience, we may have to start with a simpler one. I will next &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1269&quot; id=&quot;link-id10fa7a50&quot;&gt;outline something to the effect&lt;/a&gt;.&lt;/p&gt; &lt;/div&gt;</description></item><item><title>Ideas on RDF Store Benchmarking</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-11-21#1086</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1086#comments</comments><pubDate>Tue, 21 Nov 2006 14:22:53 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:53:43-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;Ideas on RDF Store Benchmarking&lt;/div&gt;
&lt;p&gt;This post presents some ideas and use cases for &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xd4ebd48&quot;&gt;RDF&lt;/a&gt; store benchmarking.&lt;/p&gt;
&lt;h4&gt;Use Cases&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Basic triple storage and retrieval. The LUBM benchmark captures many aspects of this.&lt;/li&gt;
&lt;li&gt;Recursive rule application. The simpler cases of this are things like transitive closure.&lt;/li&gt;
&lt;li&gt;Mapping of relational &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xdf3ec80&quot;&gt;data&lt;/a&gt; to RDF. Since relational benchmarks are well established, as in the TPC benchmarks, the schemas and test data generation can come from there. The problem is that the D/H/R benchmarks consist of aggregates and grouping exclusively but &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xdba2bc0&quot;&gt;SPARQL&lt;/a&gt; does not have these. &lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Benchmarking Triple Stores&lt;/h4&gt;
&lt;p&gt;An RDF benchmark suite should meet the following criteria:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Have a single scale factor.&lt;/li&gt;
&lt;li&gt;Produce a single metric, queries per unit of time, for example. The metric should be concisely expressible, for example 10 qpsR at 100M, options 1, 2, 3. Due to the heterogeneous nature of the systems under test, the result&amp;#39;s short form likely needs to specify the metric, scale and options included in the test.&lt;/li&gt;
&lt;li&gt;Have optional parts, such as different degrees of inferencing and maybe language extensions such as full text, as this is a likely component of any social software.&lt;/li&gt;
&lt;li&gt;Have a specification for a full disclosure report, TPC style, even though we can skip the auditing part in the interest of making it easy for vendors to publish results and be listed.&lt;/li&gt;
&lt;li&gt;Have a subject domain where real data are readily available and which is broadly understood by the community. For example, SIOC data about on-line communities seems appropriate. Typical degree of connectedness, number of triples per person etc can be measured from real files .&lt;/li&gt;
&lt;li&gt;Have a diverse enough workload. This should include initial bulk load of data, some adding of triples during the run and continuous query load.&lt;/li&gt;Â &lt;/ul&gt;
&lt;p&gt;The query load should illustrate the following types of operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Basic lookups, such as would be made for filling in a person&amp;#39;s home page in a social networks app. List data of user plus names and emails of friends. Relatively short joins, unions, and optionals.&lt;/li&gt;
&lt;li&gt;Graph operations like shortest path from individual to individual in a social network.&lt;/li&gt;
&lt;li&gt;Selecting data with drill down, as in faceted browsing. For example, start with articles having &lt;a href=&quot;http://dbpedia.org/resource/Tag&quot; id=&quot;link-id0x18a6cb78&quot;&gt;tag&lt;/a&gt; t, see distinct tags of articles with tag t, select another tag t2 to see the distinct tags of articles with both t and t2 and so forth.&lt;/li&gt;
&lt;li&gt;Retrieving all closely related nodes, as in composing a SIOC snapshot over a person&amp;#39;s post in different communities, the recent activity report for a forum etc. These will be construct or describe queries. The coverage of describe is unclear, hence construct may be better.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If we take an application like LinkedIn as a model, we can get a reasonable estimate of the relative frequency of different queries. For the queries per second metric, we can define the mix similarly to &lt;a href=&quot;http://dbpedia.org/resource/TPC-C&quot; id=&quot;link-id0xd238678&quot;&gt;TPC C&lt;/a&gt;. We count executions of the main query and divide by running time. Within this time, for every 10 executions of the main query there are varying numbers of executions of secondary queries, typically more complex ones.&lt;/p&gt;
&lt;h4&gt;Full Disclosure Report&lt;/h4&gt;
&lt;p&gt;The report contains basic TPC-like items such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Metric qps/scale/options&lt;/li&gt;
&lt;li&gt;Software used, DBMS, RDF toolkit if separate&lt;/li&gt;
&lt;li&gt;Hardware.  Number, clock and type of CPUs per machine, number of machines in cluster, RAM per machine, disks per machine, manufacturer, price of hardware/software&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These can go into a summary spreadsheet that is just like the TPC ones.&lt;/p&gt;
&lt;p&gt;Additionally, the full report should include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Configuration files for DBMS, web server, other components.&lt;/li&gt;
&lt;li&gt;Parameters for test driver, i.e., number of clicks, how many concurrent clicks. The tester determines the degree of parallelism that gets the best throughput and should indicate this in the report. Making a graph of throughput as function of concurrent clients is a lot of work and maybe not necessary here.&lt;/li&gt;
&lt;li&gt;Duration in real time. Since for any large database with a few G of working set the warm up time is easily 30 minutes, the warm up time should be mentioned but not included in the metric. The measured interval should not be less than 1h in duration and should reflect a &amp;quot;steady state,&amp;quot; as defined in the TPC rules.&lt;/li&gt;
&lt;li&gt;Source code of server side application logic. This can be inference rules, stored procedures, dynamic web pages or any other server side software-like thing that exists or is modified for the purpose of the test.&lt;/li&gt;
&lt;li&gt;Specification of test driver. If there is a commonly used test driver, its type, parameters and version. If the test driver is custom, reference to its source code.&lt;/li&gt;
&lt;li&gt;Database sizes. For a preallocated database of n G, how much was free after the initial load, how much after the test run? How many bytes per triple.&lt;/li&gt;
&lt;li&gt;CPU/IO. This may not always be readily measurable but is interesting still. Maybe a realistic spec is listing the sum of CPU minutes across allÂ  server machines and server processes. For IO, maybe the system totals from iostat before and after the full run, including load and warm-up. If the DBMS and RDF toolkits are separate, it is interesting to know the division of CPU time between them. &lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Test Drivers&lt;/h4&gt;
&lt;p&gt;OpenLink has a multithreaded C program that simulates n web users multiplexed over m threads. For example, 10000 users with 100 threads, each user with its own state, so that they carry out their respective usage patterns independently, getting served as soon as the server is available, still having no more than m requests going at any time. The usage pattern is something like go check the mail, browse the catalogue, add to shopping cart etc. This can be modified to browse a social network database and produce the desired query mix. This generates &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0xca840f8&quot;&gt;HTTP&lt;/a&gt; requests, hence would work against a SPARQL end point or any set of dynamic web pages.&lt;/p&gt;
&lt;p&gt;The program produces a running report of the clicks per second rate and statistics at the end, listing the min/avg/max times per operation.&lt;/p&gt;
&lt;p&gt;This can be packaged as a separate open source download once the test spec is agreed upon.&lt;/p&gt;
&lt;p&gt;For generating test data, a modification of the LUBM generator is probably the most convenient choice.&lt;/p&gt;
&lt;h4&gt;Benchmarking Relational to RDF Mapping&lt;/h4&gt;
&lt;p&gt;This area is somewhat more complex than triple storage.&lt;/p&gt;
&lt;p&gt;At least the following factors enter into the evaluation:Â  &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Degree of SPARQL compliance. For example, can one have a variable as predicate? Are there limits on optionals and unions?&lt;/li&gt;
&lt;li&gt;Are the data being queried split over multiple &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0xcbb8688&quot;&gt;RDBMS&lt;/a&gt; and joined between them?&lt;/li&gt;
&lt;li&gt;Type of use case. Is this about navigational lookups or about statistics? OLTP or OLAP? It would be the former, as SPARQL does not really have aggregation. Still, many of the interesting queries are about comparing large data sets.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rationale for mapping relational data to RDF is often data integration. Even in simple cases like the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0xc75bee0&quot;&gt;OpenLink Data Spaces&lt;/a&gt; applications, a single SPARQL query will often result in a union of queries over distinct relational schemas, each somewhat similar but different in its details.&lt;/p&gt;
&lt;p&gt;A test for mapping should represent this aspect. Of course, translating a column into a predicate is easy and useful, specially when copying data. Still, the full power of mapping seems to involve a single query over disparate sources with disparate schemas.&lt;/p&gt;
&lt;p&gt;A real world case is OpenLink&amp;#39;s ongoing work for mapping &lt;a href=&quot;http://dbpedia.org/resource/WordPress&quot; id=&quot;link-id0x17f05bf0&quot;&gt;WordPress&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/MediaWiki&quot; id=&quot;link-id0xc949000&quot;&gt;Mediawiki&lt;/a&gt;, phpBB, &lt;a href=&quot;http://dbpedia.org/resource/Drupal&quot; id=&quot;link-id0xdb35e18&quot;&gt;Drupal&lt;/a&gt;, and possibly other popular web applications into SIOC.&lt;/p&gt;
&lt;p&gt;Using this as a benchmark might make sense because the source schemas are widely known, there is a lot of real world data in these systems, and the test driver might even be the same as with the above proposed triple store benchmark. The query mix might have to be somewhat tailored.&lt;/p&gt;
&lt;p&gt;Another &amp;quot;enterprise style&amp;quot; scenario might be to take the TPC C and TPC D databases â after all both have products, customers and orders â and map them into a common ontology. Then there could be queries sometimes running on only one, sometimes joining both.&lt;/p&gt;
&lt;p&gt;Considering the times and the audience, the WordPress/Mediawiki scenario might be culturally more interesting and more fun to demo.&lt;/p&gt;
&lt;p&gt;The test has two aspects: Throughput and coverage. I think these should be measured separately.&lt;/p&gt;
&lt;p&gt;The throughput can be measured with queries that are generally sensible, such as &amp;quot;get articles by an author that I know with tags t1 and t2.&amp;quot;&lt;/p&gt;
&lt;p&gt;Then there are various pathological queries that work specially poorly with mapping. For example, if the types of subjects are not given, if the predicate is known at run time only, if the graph is not given, we get a union of everything joined with another union of everything and many of the joins between the terms of the different unions are identically empty but the software may not know this.&lt;/p&gt;
&lt;p&gt;In a real world case, I would simply forbid such queries. In the benchmarking case, these may be of some interest. If the mapping is clever enough, it may survive cases like &amp;quot;list all predicates and objects of everything called gizmo where the predicate is in the product ontology&amp;quot;.&lt;/p&gt;
&lt;p&gt;It may be good to divide the test into a set of straightforward mappings and special cases and measure them separately. The former will be queries that a reasonably written application would do for producing user reports.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>Virtuoso and ODS Update</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-08-10#1025</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1025#comments</comments><pubDate>Thu, 10 Aug 2006 11:55:26 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:53:34.000008-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;Virtuoso and ODS Update&lt;/div&gt;
&lt;p&gt;We have released an update of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1b0d5100&quot;&gt;Virtuoso&lt;/a&gt; Open Source Edition and the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x1770ad30&quot;&gt;OpenLink Data Spaces&lt;/a&gt; suite.&lt;/p&gt;
&lt;p&gt;This marks the coming of age of our &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1a1c6800&quot;&gt;RDF&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1779b790&quot;&gt;SPARQL&lt;/a&gt; efforts. We have the new &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x170db778&quot;&gt;SQL&lt;/a&gt; cost model with SPARQL awareness, we have applications which present much of their &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x18ab4600&quot;&gt;data&lt;/a&gt; as SIOC, FOAF, ATOM OWL and other formats.&lt;/p&gt;
&lt;p&gt;We continue refining these technologies. Our next roadmap item is mapping relational data into RDF and offering SPARQL access to relational data without data duplication. Expect a white paper about this soon.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>Object Relational Rediscovered?</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-13#1003</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1003#comments</comments><pubDate>Thu, 13 Jul 2006 12:33:32 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:26-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;Object Relational Rediscovered?&lt;/div&gt;
&lt;p&gt;I have recently read some of Microsoft&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x173cea20&quot;&gt;ADO&lt;/a&gt; .NET 3 papers. I am reminded of the distant past when I designed Kubl, which later became OpenLink &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x18bdfe68&quot;&gt;Virtuoso&lt;/a&gt;. So I will reminisce and speculate a little.&lt;/p&gt;
&lt;p&gt;So now is the time when polymorphic queries and mixing relational style joins and object style navigation become politically acceptable and even recommended and there finally is a workable solution to having a foreign key in the database and a pointer or set of pointers in the client application. Not to mention change tracking so as to be able to update in-memory &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xd6f0ae0&quot;&gt;data&lt;/a&gt; structures and commit a delta against the database without explicit update statements.&lt;/p&gt;
&lt;p&gt;All these questions existed already in the mid 90s and earlier. Since I was coming from OO and LISP into the database world, I even felt these questions to be important. The solution in the earliest Kubl was to have inheritance between tables, what became the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xddcdac0&quot;&gt;SQL&lt;/a&gt; 2K &lt;code&gt;UNDER&lt;/code&gt; clause, and a virtual column called &lt;code&gt;_ROW&lt;/code&gt; that would select a serialization of the primary key entry. Then there was the function &lt;code&gt;row_key()&lt;/code&gt;, which when applied to a &lt;code&gt;_ROW&lt;/code&gt; virtual column would return a database-wide unique identifier of the row, containing the key info and the key part values plus which subtable of the table was at hand. Then there was a function for dereferencing a &lt;code&gt;row_key&lt;/code&gt; for getting the &lt;code&gt;_ROW&lt;/code&gt;. And one could store &lt;code&gt;row_keys&lt;/code&gt; into columns and dereference these in queries. Within SQL, one could use the &lt;code&gt;row_column&lt;/code&gt; function to extract individual column values from a &lt;code&gt;row_key&lt;/code&gt; or &lt;code&gt;_ROW&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This was all fine server side. But we also had a client for Franz Inc.&amp;#39;s Allegro Common Lisp that talked to Kubl&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0xde2c348&quot;&gt;ODBC&lt;/a&gt; listener. This client had the basic statements and prepared statements and result sets, parameters and array parameters, a little like &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x156409f8&quot;&gt;JDBC&lt;/a&gt; does now. But the extra was that we could do a mapping between a Lisp struct or object and a database key, so the &lt;code&gt;_ROW&lt;/code&gt; would automatically materialize into the Lisp struct or class instance. And the mapping between these materializations and the &lt;code&gt;row_keys&lt;/code&gt; identifying them in the database were kept in a thread environment called object space. Updates could be relational-style &lt;code&gt;UPDATEs&lt;/code&gt; or consist of putting a &lt;code&gt;_ROW&lt;/code&gt; serialization in database format back into the Kubl store with a single SQL function.&lt;/p&gt;
&lt;p&gt;This was different from just storing object serializations into LOB columns, as is often done, insofar as the object classes and data members were really database tables and columns, thus native to the DBMS, not just opaque data to be processed client-side only.&lt;/p&gt;
&lt;p&gt;So it was then possible to program a little like is shown in the ADO .NET 3 demos today, some ten years later.&lt;/p&gt;
&lt;p&gt;Some of these functions still exist in Virtuoso, albeit in a deprecated state, and there is no client that can use these to any advantage. Indeed, we dropped this line of work when Kubl became Virtuoso, mostly because there was no standard and no client applications that would use such features. Instead, we concentrated on virtual &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x175a7b10&quot;&gt;RDBMS&lt;/a&gt;, transparently accessing any third party data via ODBC.&lt;/p&gt;
&lt;p&gt;Now however, as objects, both native SQL and Java and .NET, have become mainstream citizens of relational databases in general, Virtuoso and otherwise, and as Microsoft has legitimized accessing whole objects and not only scalar columns in result sets as part of ADO .NET 3, these things might be worth a second look.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>Introducing Virtuoso Open Source Edition</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-04-11#950</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=950#comments</comments><pubDate>Tue, 11 Apr 2006 16:33:07 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:32-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;Introducing Virtuoso Open Source Edition&lt;/div&gt;
&lt;p&gt;I am Orri Erling, program manager for &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xd7d9bc0&quot;&gt;Virtuoso&lt;/a&gt; at &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id0xd9951b0&quot;&gt;OpenLink Software&lt;/a&gt;. This &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x1775bac0&quot;&gt;blog&lt;/a&gt; is about any and all aspects of technology that have to do with Virtuoso.&lt;/p&gt;
&lt;p&gt;The launch of &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main&quot; id=&quot;link-id10b0c208&quot;&gt;Virtuoso Open Source Edition (VOS)&lt;/a&gt; marks a new period in our participation in the database world. We will henceforth be much more active, publish much more material, have a faster release cycle and actively reach out to the various areas of the open source community.&lt;/p&gt;
&lt;p&gt;We have years worth of demos, white papers, articles, a suite of Virtuoso based applications, and much more that we will be unveiling over the following months.&lt;/p&gt;
&lt;p&gt;We will track different aspects of Virtuoso work on this and related blogs. In the middle term, we will talk about the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
  &lt;b&gt;&lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x171d1158&quot;&gt;RDF&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1a5dbd50&quot;&gt;SPARQL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x17106870&quot;&gt;semantic web&lt;/a&gt; work&lt;/b&gt; - The initial VOS release has SPARQL support and this will continue to be refined and optimized. We will introduce SPARQL benchmark suites and the like as these become ready.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Relational database&lt;/b&gt; - Virtuoso&amp;#39;s extensible &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xd3a91b0&quot;&gt;SQL&lt;/a&gt; and relational storage engine is the platform on which all the rest stands. Thus this continues to be improved, ranging from low level database engine work to SQL optimizations to various developer convenience features. A database-only configuration of Virtuoso is another possibility.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;DAV and web services&lt;/b&gt; - Web services are the main entry point for all Virtuoso&amp;#39;s features. These may eventually become more significant than the traditional SQL client interfaces, of which Virtuoso supports several.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is a whole suite of next generation file server features to be unveiled. These include items such as automatic metadata extraction and logical views on content based on its metadata, permissions etc.&lt;/p&gt;
&lt;p&gt;In the immediate future, we will:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Keep enhancing the VOS wiki and edit the existing base of unpublished material to be ready for publication on this platform.&lt;/li&gt;
&lt;li&gt;Keep adding to technical notes and FAQ&amp;#39;s on compiling and running on different platforms and using the different run time hosting options of Virtuoso.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The VOS development CVS will be updated at high frequency, in some areas even weekly. Stable snapshots will be made available 3 or 4 times a year.&lt;/p&gt;
&lt;p&gt;We will have a very exciting spring, with radically more participation in the database and open source worlds than ever. Look for frequent updates on this blog.&lt;/p&gt;
&lt;/div&gt;</description></item>
</channel>
</rss>
