<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/">
  <rss:title>OpenLink Virtuoso (Product Blog)</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/</rss:link>
  <rss:description>A great place to track Virtuoso&#39;s rapid evolution.</rss:description>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-11T00:19:18Z</dc:date>
  <dc:rights xmlns:dc="http://purl.org/dc/elements/1.1/">OpenLink Software 1998-2006</dc:rights>
  <dc:language xmlns:dc="http://purl.org/dc/elements/1.1/">en-us</dc:language>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1451" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1450" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1446" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1436" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1435" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1432" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1423" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1419" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1410" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1403" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1401" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1393" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1383" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1382" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1381" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1451">
  <rss:title>Virtuoso Cluster Paper Update</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1451</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1451</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1451</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T10:02:33Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">An updated version of the paper about Virtuoso Cluster is available at 2008webscale_rdf.pdf</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>An updated version of the paper about <a href="http://virtuoso.openlinksw.com" id="link-id0xc0abc50">Virtuoso</a> Cluster is available at <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16459248">2008webscale_rdf.pdf</a>
</p>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1450">
  <rss:title>Virtuoso Update, Billion Triples and Outlook</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1450</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1450</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1450</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-10-02T10:02:32Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso Update, Billion Triples and Outlook I will say a few things about what we have been doing and where we can go. Firstly, we have a fairly scalable platform with Virtuoso 6 Cluster. It was most recently tested with the workload discussed in the previous Billion Triples post. There is an updated version of the paper about this. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe. Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more SQL optimizations specific to RDF. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work. We spent a lot of time around the Berlin SPARQL Benchmark story, so we got to the more advanced stuff like the Billion Triples Challenge rather late. We did along the way also run BSBM with an Oracle back-end, with Virtuoso mapping SPARQL to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL. RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the RDB2RDF XG. Examples of complex warehouses include the Neurocommons database, the Billion Triples Challenge, and the Garlik DataPatrol. In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the Linked Data forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the data web becomes as indispensable as presence on the HTML web. I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post. Now, all the things shown in the Billion Triples post can be done with a relational system specially built for each purpose. Since we are a general purpose RDBMS, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time. Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later. The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility. We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of UMBEL and OpenCyc. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case. We expect to be able to combine geography, social proximity, subject matter, and named entities, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface. We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person. Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible. The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of information and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity. Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like list the top 10 attributes with the most distinct values for all persons cannot be done in SQL. SQL simply does not allow the columns to be variable. Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience. The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports. Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Virtuoso Update, Billion Triples and Outlook</div>
<p>I will say a few things about what we have been doing and where we can go.</p>

<p>Firstly, we have a fairly scalable platform with <a href="http://virtuoso.openlinksw.com" id="link-id0x1aa82dc0">Virtuoso</a> 6 Cluster. It was most recently tested with the workload discussed in the previous <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id1638a5b8">Billion Triples post</a>.</p>

<p>There is an updated version of <a href="http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf" id="link-id16280a68">the paper about this</a>. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.</p>

<p>Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more <a href="http://dbpedia.org/resource/SQL" id="link-id0x1abd3f38">SQL</a> optimizations specific to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1adbe410">RDF</a>. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.</p>

<p>We spent a lot of time around the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1aaa0e78">Berlin SPARQL Benchmark</a> story, so we got to the more advanced stuff like the <a href="http://challenge.semanticweb.org/" id="link-id0x1a860a50">Billion Triples Challenge</a> rather late. We did along the way also run <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1a27f2a8">BSBM</a> with an <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1ad5c918">Oracle</a> back-end, with Virtuoso mapping <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1cf0e4a0">SPARQL</a> to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.</p>

<p>RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0x1ab96bb0">RDB2RDF XG</a>. Examples of complex warehouses include the <a href="http://neurocommons.org/page/Main_Page" id="link-id0x1adb2db0">Neurocommons</a> database, the Billion Triples Challenge, and the <a href="http://www.garlik.com/" id="link-id0x1925c7b0">Garlik DataPatrol</a>.</p>

<p>In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x1c6d1480">Linked Data</a> forum. BSBM&#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the <a href="http://dbpedia.org/resource/Data" id="link-id0x1a937400">data</a> web becomes as indispensable as presence on the HTML web.</p>

<p>I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.</p>

<p>Now, all the things shown in the <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445" id="link-id14de1d18">Billion Triples post</a> can be done with a relational system specially built for each purpose. Since we are a general purpose <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1a457c70">RDBMS</a>, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.</p>

<p>Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.</p>

<p>The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &quot;same as&quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.</p>

<p>We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of <a href="http://umbel.org/about/" id="link-id0x1aa5ea18">UMBEL</a> and <a href="http://dbpedia.org/resource/Cyc" id="link-id0x1a631a20">OpenCyc</a>. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.</p>

<p>We expect to be able to combine geography, social proximity, subject matter, and <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0x1aebdcc8">named entities</a>, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.</p>

<p>We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.</p>

<p>Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.</p>

<p>The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of <a href="http://dbpedia.org/resource/Information" id="link-id0x1ab88490">information</a> and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.</p>

<p>Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like <i>list the top 10 attributes with the most distinct values for all persons</i> cannot be done in SQL. SQL simply does not allow the columns to be variable.</p>

<p>Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.</p>

<p>The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.</p>

<p>Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1446">
  <rss:title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1446</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1446</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1446</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-30T16:24:34Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Introduction We use Virtuoso 6 Cluster Edition to demonstrate the following: Text and structured information based lookups Analytics queries Analysis of co-occurrence of features like interests and tags. Dealing with identity of multiple IRI&#39;s (owl:sameAs) The demo is based on a set of canned SPARQL queries that can be invoked using the OpenLink Data Explorer (ODE) Firefox extension. The demo queries can also be run directly against the SPARQL end point. The demo is being worked on at the time of submission and may be shown online by appointment. Automatic annotation of the data based on named entity extraction is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted named entities and their relationships in the UMBEL and Open CYC ontologies. Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores. The Data The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original. The triple count is 1115 million. In the case of web harvested resources, the data is loaded in one graph per resource. In the case of larger data sets like Dbpedia or the US census, all triples of the provenance share a data set specific graph. All string literals are additionally indexed in a full text index. No stop words are used. Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices. The Queries The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known SQL features like aggregation with grouping and existence and value subqueries and on the other of RDF specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity. Simple Lookups sparql select ?s ?p (bif:search_excerpt (bif:vector (&#39;semantic&#39;, &#39;web&#39;), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) } limit 10 ; This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms. sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?tp order by desc 2 limit 40 ; This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt. What are these things called? sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization. sparql define input:inference &#39;b3s&#39; select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &quot;plaid_skirt&quot;)) } group by ?lbl order by desc 2 ; We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing. sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) } } group by ?g order by desc 2 ; Now some web 2.0 tagging of search results. The tag cloud of &quot;computer&quot; sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains &quot;computer&quot; . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ; This query will find the posters who talk the most about sex. sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, &quot;sex&quot;)) } group by ?auth order by desc 2 ; Analytics We look for people who are joined by having relatively uncommon interests but do not know each other. sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 &amp;&amp; !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ; The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc. Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF. create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00)); How many times each topic is mentioned? insert into tag_count select * from (sparql define output:valmode &quot;LONG&quot; select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast); Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2. insert into tag_coincidence (tc_t1, tc_t2, tc_count) select &quot;t1&quot;, &quot;t2&quot;, cnt from (select &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt from (sparql define output:valmode &quot;LONG&quot; select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where &quot;t1&quot; &lt; &quot;t2&quot; group by &quot;t1&quot;, &quot;t2&quot;) xx where isiri_id (&quot;t1&quot;) and isiri_id (&quot;t2&quot;) option (quietcast); Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster. update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2); Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing. To show the URI&#39;s of the tags: select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc; Social Networks We look at what interests people have sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt &gt; 100) } order by desc 2 limit 100 ; Now the same for the Harry Potter fans sparql select ?i2 count (*) where { ?p foaf:interest &lt;http://www.livejournal.com/interests.bml?int=harry+potter&gt; . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ; We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known. sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ; We look for a well connected person to start from. sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ; We look for the most connected of the many online identities of Stefan Decker. sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ; We count the transitive closure of Stefan Decker&#39;s connections sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Now we do the same while following owl:sameAs links. sparql define input:same-as &quot;yes&quot; select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &lt;mailto:stefan.decker@deri.org&gt;) } ; Demo System The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<h2>Introduction</h2> 

<p>We use <a href="http://virtuoso.openlinksw.com" id="link-id0xb03e418">Virtuoso</a> 6 Cluster Edition to demonstrate the following:</p>
<ul>
<li>Text and structured <a href="http://dbpedia.org/resource/Information" id="link-id0xbd9dae8">information</a> based lookups</li>
<li>Analytics queries</li>
<li>Analysis of co-occurrence of features like interests and tags.</li>
<li>Dealing with identity of multiple IRI&#39;s (<a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0xb383dd8">owl</a>:sameAs)</li>
</ul>

<p>The demo is based on a set of canned <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xbda6298">SPARQL</a> queries that can be invoked using the <a href="http://ode.openlinksw.com/" id="link-id0xbb292f0">OpenLink Data Explorer</a> (<a href="http://ode.openlinksw.com/" id="link-id0xc263528">ODE</a>) Firefox extension.</p>
<p>The demo queries can also be run directly against the SPARQL end point.</p>

<p>The demo is being worked on at the time of submission and may be shown online by appointment.</p>

<p>Automatic annotation of the <a href="http://dbpedia.org/resource/Data" id="link-id0xa173378">data</a> based on <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xbdda558">named entity extraction</a> is
being worked on at the time of this submission.  By the time of ISWC
2008 the set of sample queries will be enhanced with queries based on
extracted <a href="http://dbpedia.org/resource/Named_entity_recognition" id="link-id0xa66fbe0">named entities</a> and their relationships in the <a href="http://umbel.org/about/" id="link-id0xa06e2c8">UMBEL</a> and Open
CYC ontologies.
</p>

<p>Also examples involving owl:sameAs are being added, likewise  with similarity metrics and search hit scores.</p>

<h2>The Data</h2>

<p>The database consists of the billion triples data sets and some additions like Umbel.   Also the Freebase extract is newer than the challenge original.</p>
<p>The triple count is 1115 million.</p>
<p>In the case of web harvested resources, the data is loaded in one graph per resource.</p>
<p>In the case of larger data sets like <a href="http://dbpedia.org/resource/DBpedia" id="link-id0xc2bf770">Dbpedia</a> or the US census, all triples of the provenance share a data set specific graph.</p>
<p>All string literals are additionally indexed in a full text index.  No stop words are used.</p>

<p>Most queries do not specify a graph.  Thus they are evaluated against the union of all the graphs in the database.
The indexing scheme is SPOG, GPOS, POGS, OPGS.  All indices ending in S are bitmap indices.
</p>

<h2>The Queries </h2>


<p>The demo uses Virtuoso SPARQL extensions  in most queries.  These
extensions consist on one hand of well known <a href="http://dbpedia.org/resource/SQL" id="link-id0xaf8cb40">SQL</a> features like
aggregation with grouping and existence and value subqueries and on
the other of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xafdceb8">RDF</a> specific features.
The latter include  run time RDFS and OWL inferencing support  and backward
chaining subclasses and transitivity.  
</p>


<h3>Simple Lookups</h3> 

<pre>sparql 
select ?s ?p (bif:search_excerpt (bif:vector (&#39;<a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0xbb64dd0">semantic&#39;, &#39;web</a>&#39;), ?o)) 
where 
  {
    ?s ?p ?o . 
    filter (bif:contains (?o, &quot;&#39;semantic web&#39;&quot;)) 
  } 
limit 10
;
</pre>

<p>This looks up triples with semantic web in the object and makes a search hit summary of the literal, 
highlighting the search terms.
</p>

<pre>sparql 
select ?tp count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 a ?tp . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?tp
order by desc 2
limit 40
;
</pre>

<p>This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.</p>
<p>What are these things called?</p>

<pre>sparql 
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 rdfs:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>Many of these things do not have a rdfs:label.  Let us use a more general concept of lable 
which groups dc:title, foaf:name and other name-like properties together.  The subproperties are 
resolved at run time, there is no materialization.
</p>

<pre>sparql 
define input:inference &#39;b3s&#39;
select ?lbl count(*) 
where 
  { 
    ?s ?p2 ?o2 . 
    ?o2 b3s:label ?lbl . 
    ?s foaf:nick ?o . 
    filter (bif:contains (?o, &quot;plaid_skirt&quot;)) 
  } 
group by ?lbl
order by desc 2
;
</pre>

<p>We can list sources by the topics they contain.  
Below we look for graphs that mention terrorist bombing.
</p>

<pre>sparql 
select ?g count(*) 
where 
  { 
    graph ?g 
      {
        ?s ?p ?o . 
        filter (bif:contains (?o, &quot;&#39;terrorist bombing&#39;&quot;)) 
      }
  } 
group by ?g 
order by desc 2
;
</pre>

<p>Now some web 2.0 tagging of search results.  The <a href="http://dbpedia.org/resource/Tag" id="link-id0xa8b89f8">tag</a> cloud of &quot;computer&quot;</p>

<pre>sparql 
select ?lbl count (*) 
where 
  { 
    ?s ?p ?o . 
    ?o bif:contains &quot;computer&quot; . 
    ?s sioc:topic ?tg .
    optional 
      {
        ?tg rdfs:label ?lbl
      }
  }
group by ?lbl 
order by desc 2 
limit 40
;
</pre>

<p>This query will find the posters who talk the most about sex.</p>

<pre>sparql 
select ?auth count (*) 
where 
  { 
    ?d dc:creator ?auth .
    ?d ?p ?o
    filter (bif:contains (?o, &quot;sex&quot;)) 
  } 
group by ?auth
order by desc 2
;
</pre>

<h3>Analytics </h3>

<p>We look for people who are joined by having relatively uncommon interests but do not know each other.</p>

<pre>sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 
where 
  {
    {
      select ?i count (*) as ?cnt 
      where 
        { ?p foaf:interest ?i } 
      group by ?i
    }
    filter ( ?cnt &gt; 1 &amp;&amp; ?cnt &lt; 10) .
    ?p1 foaf:interest ?i .
    ?p2 foaf:interest ?i .
    filter  (?p1 != ?p2 &amp;&amp; 
             !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;&amp; 
             !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) .
    ?p1 foaf:nick ?n1 .
    ?p2 foaf:nick ?n2 .
  } 
order by ?cnt 
limit 50
;
</pre>

<p>The query takes a fairly long time, mostly spent counting the interested in 25M interest triples.  
It then takes people that share the interest and checks that neither claims to know the other.  
It then sorts the results rarest interest first.  The query can be written more efficently but is 
here just to show that database-wide scans of the population are possible ad hoc.
</p>

<p>Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style
related tags line at the bottom of a search result page.  This showcases the use of SQL together 
with SPARQL.  The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is 
much more efficiently done in SQL, specially since it gets updated as the data changes.  
This is an example of materialized intermediate results based on warehoused RDF.
</p>

<pre>create table 
tag_count (tcn_tag iri_id_8, 
           tcn_count int, 
           primary key (tcn_tag));
           
alter index 
tag_count on tag_count partition (tcn_tag int (0hexffff00));

create table 
tag_coincidence (tc_t1 iri_id_8, 
                 tc_t2 iri_id_8, 
                 tc_count int, 
                 tc_t1_count int, 
                 tc_t2_count int, 
                 primary key  (tc_t1, tc_t2))

alter index 
tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00));

create index 
tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00));
</pre>

<p>How many times each topic is mentioned?</p>

<pre>
insert into tag_count 
  select * 
    from (sparql define output:valmode &quot;LONG&quot; 
                 select ?t count (*) as ?cnt 
                 where 
                   {
                     ?s sioc:topic ?t
                   } 
                 group by ?t) 
    xx option (quietcast);
</pre>

<p>Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &lt; that of t2.</p>

<pre>insert into tag_coincidence  (tc_t1, tc_t2, tc_count)
  select &quot;t1&quot;, &quot;t2&quot;, cnt 
    from 
      (select  &quot;t1&quot;, &quot;t2&quot;, count (*) as cnt 
         from 
           (sparql define output:valmode &quot;LONG&quot;
                   select ?t1 ?t2 
                     where 
                       {
                         ?s sioc:topic ?t1 . 
                         ?s sioc:topic ?t2 
                       }) tags
         where &quot;t1&quot; &lt; &quot;t2&quot; 
         group by &quot;t1&quot;, &quot;t2&quot;) xx
    where isiri_id (&quot;t1&quot;) and 
          isiri_id (&quot;t2&quot;) 
    option (quietcast); 
</pre>

<p>Now put the individual occurrence counts into the same table with the co-occurrence.  This 
denormalization makes the related tags lookup faster.
</p>


<pre>update tag_coincidence 
  set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1),
      tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2);
</pre>

<p>Now each tag_coincidence row has the joint occurrence count and individual occurrence counts.  
A single select will return a Technorati-style related tags listing.
</p>

<p>To show the <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x9d4bc60">URI</a>&#39;s of the tags:
</p>

<pre>select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count 
  from tag_coincidence 
  order by tc_count desc;
</pre>

<h3>Social Networks </h3>

<p>We look at what interests people have </p>

<pre>sparql 
select ?o ?cnt  
where 
  {
    {
      select ?o count (*) as ?cnt 
        where 
          {
            ?s foaf:interest ?o
          } 
        group by ?o
    } 
    filter (?cnt &gt; 100) 
  } 
order by desc 2 
limit 100
;
</pre>

<p>Now the same for the Harry Potter fans </p>

<pre>sparql 
select ?i2 count (*) 
where 
  { 
    ?p foaf:interest &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0xba0b390">http</a>://www.livejournal.com/interests.bml?int=harry+potter&gt; .
    ?p foaf:interest ?i2 
  } 
group by ?i2 
order by desc 2 
limit 20
;
</pre>

<p>We see whether knows relations are symmmetrical.  We return the top n people that others claim to know without being reciprocally known.</p>

<pre>sparql 
select ?celeb, count (*) 
where 
  { 
    ?claimant foaf:knows ?celeb . 
    filter (!bif:exists ((select (1) 
                          where 
                            {
                              ?celeb foaf:knows ?claimant 
                            }))) 
  } 
group by ?celeb 
order by desc 2 
limit 10
;
</pre>

<p>We look for a well connected person to start from.</p>

<pre>sparql 
select ?p count (*) 
where 
  {
    ?p foaf:knows ?k 
  } 
group by ?p 
order by desc 2 
limit 50
;
</pre>

<p>We look for the most connected of the many online identities of Stefan Decker.</p>

<pre>sparql 
select ?sd count (distinct ?xx) 
where 
  { 
    ?sd a foaf:Person . 
    ?sd ?name ?ns . 
    filter (bif:contains (?ns, &quot;&#39;Stefan Decker&#39;&quot;)) . 
    ?sd foaf:knows ?xx 
  } 
group by ?sd 
order by desc 2
;
</pre>

<p>We count the transitive closure of Stefan Decker&#39;s connections </p>

<pre>sparql 
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<p>Now we do the same while following owl:sameAs links.</p>

<pre>sparql 
define input:same-as &quot;yes&quot;
select count (*) 
where 
  { 
    {
      select * 
      where 
        { 
          ?s foaf:knows ?o 
        }
    }
    option (transitive, t_distinct, t_in(?s), t_out(?o)) . 
    filter (?s = &lt;mailto:stefan.decker@deri.org&gt;)
  }
;
</pre>

<h2>Demo System</h2> 

<p>The system runs on Virtuoso 6 Cluster Edition.  The database is partitioned into 12 partitions, 
each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 
machines, each with  2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes 
and corresponding partitions can be spread over a larger number of machines.  If each ran on its 
own server with 16GB RAM, the whole data set could be served from memory. This is desirable for 
search engine or fast analytics applications. Most of the demonstrated queries run in memory on 
second invocation. The timing difference between first and second run is easily an order of 
magnitude.
</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1436">
  <rss:title>Requirements for Relational-to-RDF Mapping</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1436</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1436</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1436</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:41:25Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Requirements for Relational-to-RDF Mapping Many of you will know about the W3C relational-to-RDF mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping. To this effect, I recently summarized the group discussions and some of our own experiences around the topic at &lt;http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling&gt;. I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis à vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific. I gave a talk to the RDB2RDF XG this spring, with these slides. The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided. At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language. Eliminating JOINs We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are JOINing using such triples, we end up with UNIONs of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario. To make the point clearer, suppose a query like &quot;list the organizations whose representatives have published about xx.&quot; Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author&#39;s affiliation in the same database. However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another: A paper in database X will usually not have an author in database Y. The IDs in database Y, even if perchance equal to the IDs in X, do not mean the same thing, and there is no point joining across databases by them. This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint. This is critical. Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted SQL over the same data sources. Expectations and Limitations on Queries SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query. Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso&#39;s SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third-party SQL, no such tricks can be used. The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run-time cost and complexity. But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF&#39;s attractions include subclasses and sub-predicates. In relational terms, these translate to UNIONs and do involve some added cost. A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings. ETL Ou Ne Pas ETL? Whether to warehouse or not? If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year&#39;s RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets. The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples. Our take is that if something is a large or very large relational store, then map; else, ETL. With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations. Conclusions If you map on demand, watch out for an explosion of UNIONs when integrating sources that talk of similar things. If you integrate lots of sources, some ETL is likely unavoidable. Look for ways of dealing with part ETL, part mapping. ETLing everything is not always best or even possible. If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage. Higher storage density and more data per index lookup on the relational side. If you map on demand, some restrictions to SPARQL may be practically necessary. These have to do with variables in predicate position, variables in class position, etc. Individual implementations may support these, but standardization will likely have to put limits on them. This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Requirements for Relational-to-RDF Mapping</div>
<p>Many of you will know about the W3C relational-to-<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e1be0a8">RDF</a> mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping.</p>

<p>To this effect, I recently summarized the group discussions and some of our own experiences around the topic at &lt;<a href="http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling" id="link-id146030e8">http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling</a>&gt;.</p>

<p>I will here discuss this less formally and more in the light of our own experience.  A working group goal statement must be neutral vis à vis the following points, even if any working group will unavoidably encounter these issues on the way.  A <a href="http://dbpedia.org/resource/Blog" id="link-id0x1e6b3950">blog</a> post on the other hand can be more specific.</p>

<p>I gave a talk to the <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id0xa0932c68">RDB2RDF XG</a> this spring, with these <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt" id="link-id14572540">slides</a>.</p>

<p>The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided.</p>

<p>At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language. </p>

<h2>Eliminating JOINs</h2> 

<p>We expect to have a situation where one virtual triple can have many possible sources.  The mapping is a union of mapped databases.  Any integration scenario will have this feature. In such a situation, if we are <code>JOIN</code>ing using such triples, we end up with <code>UNION</code>s of all databases that could produce the triples in question.   This is generally not desired.  Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario.</p>

<p>To make the point clearer, suppose a query like &quot;list the organizations whose representatives have published about <i>xx</i>.&quot;  Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with <a href="http://dbpedia.org/resource/Tag" id="link-id0xa0977bf0">tag</a> <i>XX</i>.  It is a matter of common sense in this scenario that a publication will have the author and the author&#39;s affiliation in the same database.  However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table.  To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another:  A paper in database <i>X</i> will usually not have an author in database <i>Y</i>.  The IDs in database <i>Y</i>, even if perchance equal to the IDs in <i>X</i>, do not mean the same thing, and there is no point joining across databases by them.</p>

<p>This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping.  If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint.</p>

<p>This is critical.  Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted <a href="http://dbpedia.org/resource/SQL" id="link-id0xa09490f8">SQL</a> over the same <a href="http://dbpedia.org/resource/Data" id="link-id0xa095efd0">data</a> sources.</p>

<h2>Expectations and Limitations on Queries</h2>

<p>
  <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1e360230">SPARQL</a> queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query.</p>

<p>
  <a href="http://virtuoso.openlinksw.com" id="link-id0x1f5edb30">Virtuoso</a> has some SQL extensions for dealing with breaking a wide table into a row per column.  This facilitates dealing with predicates that are not known at query compile time.  If the table in question is not managed by Virtuoso, Virtuoso&#39;s SQL virtualization/federation takes care of the matter.  If a mapping system goes directly to third-party SQL, no such tricks can be used.</p>

<p>The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined.  For example, one will probably have to require that all predicates be literals.  The alternative is prohibitive run-time cost and complexity.</p>

<p>But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF&#39;s attractions include subclasses and sub-predicates.  In relational terms, these translate to <code>UNION</code>s and do involve some added cost.  A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive.  Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings.</p>

<h2>ETL Ou Ne Pas ETL?</h2>

<p>Whether to warehouse or not?  If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year&#39;s RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets.  The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples.</p>

<p>Our take is that if something is a large or very large relational store, then map; else, ETL.  With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations.</p>

<h2>Conclusions</h2> 

<ul>
<li>If you map on demand, watch out for an explosion of <code>UNION</code>s when integrating sources that talk of similar things.</li>
<li>If you integrate lots of sources, some ETL is likely unavoidable.  Look for ways of dealing with part ETL, part mapping.  ETLing everything is not always best or even possible.</li>
<li>If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage.  Higher storage density and more data per index lookup on the relational side.</li>
<li>If you map on demand, some restrictions to SPARQL may be practically necessary.  These have to do with variables in predicate position, variables in class position, etc.  Individual implementations may support these, but standardization will likely have to put limits on them.</li>
</ul>

<p>This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across.  This is a sort of addendum to the requirements I outlined on the ESW wiki.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1435">
  <rss:title>Transitivity and Graphs for SQL</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1435</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1435</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1435</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-08T09:41:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Transitivity and Graphs for SQL Background I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the SQL query language. The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead. It is now time to apply this principle to graph traversal. The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based data structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers. The ad-hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was part of b, and b part of c, the implied fact that a is part of c would be inserted explicitly into the database as a pre-query step. This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc. Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying. The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a UNION of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children. The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving JOINs between such recursive views. This model of recursion seems to lose SQL&#39;s non-procedural nature. One can no longer easily rearrange JOINs based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach. Take a question like &quot;list the parts of products of category C which have materials that are classified as toxic.&quot; Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure. Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth — regular cost based optimization. Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF. In Virtuoso, we see SPARQL as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, ipso facto, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation. SQL and Transitivity We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., SELECT in another SELECT&#39;s FROM clause, with a TRANSITIVE clause. Consider the data: CREATE TABLE &quot;knows&quot; (&quot;p1&quot; INT, &quot;p2&quot; INT, PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;) ); ALTER INDEX &quot;knows&quot; ON &quot;knows&quot; PARTITION (&quot;p1&quot; INT); CREATE INDEX &quot;knows2&quot; ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) PARTITION (&quot;p2&quot; INT); We represent a social network with the many-to-many relation &quot;knows&quot;. The persons are identified by integers. INSERT INTO &quot;knows&quot; VALUES (1, 2); INSERT INTO &quot;knows&quot; VALUES (1, 3); INSERT INTO &quot;knows&quot; VALUES (2, 4); SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p1&quot; = 1; We obtain the result: p1 p2 1 3 1 2 1 4 The operation is reversible: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 Since now we give p2, we traverse from p2 towards p1. The result set states that 4 is known by 2 and 2 is known by 1. To see what would happen if x knowing y also meant y knowing x, one could write: SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot; FROM (SELECT &quot;p1&quot;, &quot;p2&quot; FROM &quot;knows&quot; UNION ALL SELECT &quot;p2&quot;, &quot;p1&quot; FROM &quot;knows&quot; ) &quot;k2&quot; ) &quot;k&quot; WHERE &quot;k&quot;.&quot;p2&quot; = 4; p1 p2 2 4 1 4 3 4 Now, since we know that 1 and 4 are related, we can ask how they are related. SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &quot;p1&quot;, &quot;p2&quot;, T_STEP (1) AS &quot;via&quot;, T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, T_STEP (&#39;path_id&#39;) AS &quot;path&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 AND &quot;p2&quot; = 4; p1 p2 via step path 1 4 1 0 0 1 4 2 1 0 1 4 4 2 0 The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., p1, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other. For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: SELECT p2, dist, (SELECT COUNT (*) FROM &quot;knows&quot; &quot;c&quot; WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot; ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, &quot;p2&quot;, t_step (&#39;step_no&#39;) AS &quot;dist&quot; FROM &quot;knows&quot; ) &quot;k&quot; WHERE &quot;p1&quot; = 1 ORDER BY &quot;dist&quot;, 3 DESC; p2 dist aggregate 2 1 1 3 1 0 4 2 0 How? The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets. Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes. Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS. Next In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Transitivity and Graphs for SQL</div>
<h2>Background</h2> 

<p>I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the <a href="http://dbpedia.org/resource/SQL" id="link-id0xa1a18c58">SQL</a> query language.</p>

<p>The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example.  It is all about extracting the common features of applications and making these the features of a platform instead.</p>

<p>It is now time to apply this principle to graph traversal.</p>

<p>The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based <a href="http://dbpedia.org/resource/Data" id="link-id0xaf8c730">data</a> structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.</p>

<p>The ad-hoc nature and very large volume of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xae41ef0">RDF</a> data makes this a timely question.  Up until now, the answer to this question has been to materialize any implied facts in RDF stores.  If <i>a</i> was part of <i>b</i>, and <i>b</i> part of <i><a href="http://dbpedia.org/resource/C_(programming_language)" id="link-id0xac9d8790">c</a></i>, the implied fact that <i>a</i> is part of <i>c</i> would be inserted explicitly into the database as a pre-query step.</p>

<p>This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query.  The activity becomes less ad-hoc.</p>

<p>Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed.  This is why with <a href="http://virtuoso.openlinksw.com" id="link-id0xb68f9d0">Virtuoso</a> we have tended to favor inference on demand (&quot;backward chaining&quot;) and mapping of relational data into RDF without copying.</p>

<p>The SQL world has taken steps towards dealing with recursion with the <code>WITH - UNION</code> construct which allows definition of recursive views.  The idea there is to define, for example, a tree walk as a <code>UNION</code> of the data of the starting node plus the recursive walk of the starting node&#39;s immediate children.</p>

<p>The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving <code>JOIN</code>s between such recursive views.  This model of recursion seems to lose SQL&#39;s non-procedural nature.  One can no longer easily rearrange <code>JOIN</code>s based on what data is given and what is to be retrieved.  If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root.  At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.</p>

<p>Take a question like &quot;list the parts of products of category <i>C</i> which have materials that are classified as toxic.&quot;  Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &quot;toxic&quot; has a multilevel substructure.</p>

<p>Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category.  One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth — regular cost based optimization.</p>

<p>Especially with RDF, there are many problems of this type.  In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.</p>

<p>In Virtuoso, we see <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xb3bdcc0">SPARQL</a> as reducing to SQL.  Any RDF-oriented database-engine or query-optimization feature is accessed via SQL.  Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, <i>ipso facto</i>, an SQL feature.  Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.</p>

<h2> SQL and Transitivity </h2>

<p>We will here look at some simple social network queries.  A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., <code>SELECT</code> in another <code>SELECT</code>&#39;s <code>FROM</code> clause, with a <code>TRANSITIVE</code> clause.</p>

<p>Consider the data:</p>

<blockquote>
 <pre><code>CREATE TABLE &quot;knows&quot; 
   (&quot;p1&quot; INT, 
    &quot;p2&quot; INT, 
    PRIMARY KEY (&quot;p1&quot;, &quot;p2&quot;)
   );
ALTER INDEX &quot;knows&quot; 
   ON &quot;knows&quot; 
   PARTITION (&quot;p1&quot; INT);
CREATE INDEX &quot;knows2&quot; 
   ON &quot;knows&quot; (&quot;p2&quot;, &quot;p1&quot;) 
   PARTITION (&quot;p2&quot; INT);
</code>
 </pre></blockquote>

<p>We represent a social network with the many-to-many relation &quot;knows&quot;.  The persons are identified by integers.</p>

<blockquote>
 <pre><code>INSERT INTO &quot;knows&quot; VALUES (1, 2);
INSERT INTO &quot;knows&quot; VALUES (1, 3);
INSERT INTO &quot;knows&quot; VALUES (2, 4);</code>
 </pre>

<pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p1&quot; = 1;</code></pre></blockquote>

<p>We obtain the result:</p>

<blockquote>
<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">3</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">2</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>The operation is reversible:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;
</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>

<p>Since now we give <i>p2</i>, we traverse from <i>p2</i> towards <i>p1</i>. The result set states that 4 is known by 2 and 2 is known by 1.</p>

<p>To see what would happen if <i>x</i> knowing <i>y</i> also meant <i>y</i> knowing <i>x</i>, one could write:</p>

<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot; 
	    FROM (SELECT 
                  &quot;p1&quot;, 
                  &quot;p2&quot; 
               FROM &quot;knows&quot; 
               UNION ALL 
                  SELECT 
                     &quot;p2&quot;, 
                     &quot;p1&quot; 
                  FROM &quot;knows&quot;
              ) &quot;k2&quot;
        ) &quot;k&quot; 
   WHERE &quot;k&quot;.&quot;p2&quot; = 4;</code>
 </pre>

<table width="100">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">4</td>
  </tr>
</table>
</blockquote>


<p>Now, since we know that 1 and 4 are related, we can ask how they are related.</p>
<blockquote>
 <pre><code>SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &quot;p1&quot;, 
            &quot;p2&quot;, 
            T_STEP (1) AS &quot;via&quot;, 
            T_STEP (&#39;step_no&#39;) AS &quot;step&quot;, 
            T_STEP (&#39;path_id&#39;) AS &quot;path&quot; 
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
      AND &quot;p2&quot; = 4;</code>
 </pre>

<table width="250">
<tr>
    <th align="center" width="50">p1</th>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">via</th>
    <th align="center" width="50">step</th>
    <th align="center" width="50">path</th>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">1</td>
    <td align="center">0</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">1</td>
    <td align="center">4</td>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<p>The two first columns are the ends of the path.  The next column is the person that is a step on the path.  The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., <i>p1</i>, has number 0.  Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.</p>

<p>For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: </p>

<blockquote>
 <pre><code>SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM &quot;knows&quot; &quot;c&quot; 
          WHERE &quot;c&quot;.&quot;p1&quot; = &quot;k&quot;.&quot;p2&quot;
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct &quot;p1&quot;, 
            &quot;p2&quot;, 
            t_step (&#39;step_no&#39;) AS &quot;dist&quot;
         FROM &quot;knows&quot;
        ) &quot;k&quot; 
   WHERE &quot;p1&quot; = 1 
   ORDER BY &quot;dist&quot;, 3 DESC;</code>
 </pre>


<table width="150">
<tr>
    <th align="center" width="50">p2</th>
    <th align="center" width="50">dist</th>
    <th align="center" width="50">aggregate</th>
  </tr>
<tr>
    <td align="center">2</td>
    <td align="center">1</td>
    <td align="center">1</td>
  </tr>
<tr>
    <td align="center">3</td>
    <td align="center">1</td>
    <td align="center">0</td>
  </tr>
<tr>
    <td align="center">4</td>
    <td align="center">2</td>
    <td align="center">0</td>
  </tr>
</table>
</blockquote>


<h2>How?</h2>

<p>The queries shown above work on Virtuoso v6.  When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant.  By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.</p>

<p>Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection.  Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.</p>

<p>Writing a generic database driven graph traversal framework on the application side, say in Java over <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0xa8a9ef8">JDBC</a>, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query.  Besides, the traversal order in such a case could not be optimized by the DBMS.</p>

<h2>Next</h2> 

<p>In a future <a href="http://dbpedia.org/resource/Blog" id="link-id0xb526a40">blog</a> post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc.  There are lots of switches for controlling different parameters of the traversal.  This is just the beginning.  I will also give examples of the use of this in SPARQL.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1432">
  <rss:title>Epistemology of the Sponger, or How Virtuoso Drives a Web Query</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1432</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1432</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1432</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-09-05T09:20:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Epistemology of the Sponger, or How Virtuoso Drives a Web Query Virtuoso has an extensive collection of RDF-izers called Sponger Cartridges. These take a web resource in one of 30+ formats (so far) and extract RDF from it. The Virtuoso Sponger is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached. We could call this query-driven crawling. The idea is intuitive — what one looks for, determines what one finds. This does however raise certain questions pertaining to the nature and ultimate possibility of knowledge, i.e., epistemology. The process of querying could be said to go from the few to the many, just like the process of harvesting data from the web, the way any search engine does. One follows links or makes joins and thereby increases one&#39;s reach. The difference is that a query has no a priori direction. If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all. Closed world, as it is said. Never mind that the friends would have had a &quot;see also&quot; link to a retrievable document that did have a phone number. The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution. What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way. Where query and crawl appeared to have a similarity, in fact they have two opposite goals. The user generally has no idea of the execution plan. In the general case, the user cannot have an idea of this plan. There are valid, over 40 year old reasons for leaving the query planning to the database. In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present. So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything? This is certainly a desirable goal, and all in the open world, distributed spirit of the web. Let us limit ourselves to queries that have some literals in the object or subject positions. A SPARQL query is basically a graph. Its vertices are variables and literals, and its edges are triple patterns. An edge is labeled by a predicate. For now, we will consider the predicate to always be a literal. From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal. Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph. Consider the query { &lt;john&gt; knows ?x . &lt;mary&gt; knows ?x . ?x label ?l }. The starting points are the literals john and mary. The john tree has one child, ?x, which has the children mary and ?l. One could notate it as { &lt;john&gt; knows ?x . {{ &lt;mary&gt; knows ?x} UNION {?x label ?l}}} That is, the head first, and if it has more than one child, a union listing them, recursively. If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred. The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for &quot;see also&quot;, &quot;same as&quot;, and other such properties of the IRI. It could also consult text based search engines. Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources. We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough. We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans. The present Sponger does not work exactly in this manner but it will be developed in this direction. Fortunately, the algorithms outlined above are nothing complicated.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Epistemology of the Sponger, or How Virtuoso Drives a Web Query</div>
<p>
  <a href="http://virtuoso.openlinksw.com" id="link-id0x1ed6cf28">Virtuoso</a> has an extensive collection of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f8d1f78">RDF</a>-izers called Sponger Cartridges.  These take a web resource in one of 30+ formats (so far) and extract RDF from it.  The Virtuoso <a href="http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html" id="link-id0x1edc90e8">Sponger</a> is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached.</p>

<p>We could call this <i>query-driven crawling</i>.  The idea is intuitive — what one looks for, determines what one finds.</p>

<p>This does however raise certain questions pertaining to the nature and ultimate possibility of <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x1f836b68">knowledge</a>, i.e., epistemology.</p>

<p>The process of querying could be said to go from the few to the many, just like the process of harvesting <a href="http://dbpedia.org/resource/Data" id="link-id0x1edb1648">data</a> from the web, the way any search engine does.  One follows links or makes joins and thereby increases one&#39;s reach.</p>

<p>The difference is that a query has no <i>a priori</i> direction.  If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all.  <a href="http://dbpedia.org/resource/Closed_world_assumption" id="link-id0x1edf1f30">Closed world</a>, as it is said. Never mind that the friends would have had a &quot;see also&quot; link to a retrievable document that did have a phone number.</p>

<p>The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution.  What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way.  Where query and crawl appeared to have a similarity, in fact they have two opposite goals.</p>

<p>The user generally has no idea of the execution plan.  In the general case, the user <i>cannot</i> have an idea of this plan.  There are valid, over 40 year old reasons for leaving the query planning to the database.  In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present.</p>

<p>So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything?  This is certainly a desirable goal, and all in the <a href="http://dbpedia.org/resource/Open_world_assumption" id="link-id0x1eb46548">open world</a>, distributed spirit of the web.</p>

<p>Let us limit ourselves to queries that have some literals in the object or subject positions. A <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1ed293f8">SPARQL</a> query is basically a graph.  Its vertices are variables and literals, and its edges are triple patterns.  An edge is labeled by a predicate.  For now, we will consider the predicate to always be a literal.  From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal.  Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph.</p>

<p>Consider the query </p>
<blockquote>
<code>{ &lt;john&gt; knows ?x . &lt;mary&gt; knows ?x . ?x label ?l }.</code>
</blockquote>  The starting points are the literals <code>john</code> and <code>mary</code>.  The <code>john</code> tree has one child, <code>?x</code>, which has the children <code>mary</code> and <code>?l</code>.  One could notate it as <blockquote>
<code>{ &lt;john&gt; knows ?x . {{ &lt;mary&gt; knows ?x} UNION {?x label ?l}}}</code>
</blockquote> That is, the head first, and if it has more than one child, a union listing them, recursively.

<p>If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would  in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred.</p>

<p>The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for &quot;see also&quot;, &quot;same as&quot;, and other such properties of the IRI.  It could also consult text based search engines.  Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources.  We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough.</p>
 
<p>We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans.</p>

<p>The present Sponger does not work exactly in this manner but it will be developed in this direction.  Fortunately, the algorithms outlined above are nothing complicated.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1423">
  <rss:title>A quick look at SP2B, the SPARQL Performance Benchmark</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1423</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1423</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1423</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-27T16:03:40Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">A quick look at SP2B, the SPARQL Performance Benchmark I finally got around to running the SP2B SPARQL Performance Benchmark on the current Virtuoso Open Source Edition, v5.0.8. I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers. I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut. I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound. The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit FROM clause added; the client was the command line Interactive SQL (iSQL). If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the SPARQL protocol is not practical. I will say something more about SP2B when I get to have a closer look.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">A quick look at SP2B, the SPARQL Performance Benchmark</div>
<p>I finally got around to running the <a href="http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B" id="link-id17bac628">SP<sup>2</sup>B SPARQL Performance Benchmark</a> on the current <a href="http://virtuoso.openlinksw.com" id="link-id0x1dcaaa48">Virtuoso</a> Open Source Edition, v5.0.8.</p>
<p>I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.</p>
<p>I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds.  This is better than the 800 or so seconds that the authors had measured.  Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.</p>
<p>I also tried it with a scale of 25M, but this became I/O bound and took a bit longer.  I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.</p>
<p>The machine was a 2GHz Xeon with 8G RAM.  The query text was the one from the authors, with an explicit <code>FROM</code> clause added; the client was the command line Interactive <a href="http://dbpedia.org/resource/SQL" id="link-id0x1be2c808">SQL</a> (iSQL).</p>
<p>If one does the test with the default index layout without specifying a graph, things will not work very well.  Also, returning the million-row results of these queries over the <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x1d7ac018">SPARQL protocol</a> is not practical.</p>
<p>I will say something more about SP<sup>2</sup>B when I get to have a closer look.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1419">
  <rss:title>Configuring Virtuoso for Benchmarking</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1419</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1419</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1419</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-25T14:06:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Configuring Virtuoso for Benchmarking I will here summarize what should be known about running benchmarks with Virtuoso. Physical Memory For 8G RAM, in the [Parameters] stanza of virtuoso.ini, set — [Parameters] ... NumberOfBuffers = 550000 For 16G RAM, double this— [Parameters] ... NumberOfBuffers = 1100000 Transaction Isolation For most cases, certainly all RDF cases, Read Committed should be the default transaction isolation. In the [Parameters] stanza of virtuoso.ini, set — [Parameters] ... DefaultIsolation = 2 Multiuser Workload If ODBC, JDBC, or similarly connected client applications are used, there must be more ServerThreads available than there will be client connections. In the [Parameters] stanza of virtuoso.ini, set — [Parameters] ... ServerThreads = 100 With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer ServerThreads than there are concurrent clients. The MaxKeepAlives should be the maximum number of expected web clients. This can be more than the ServerThreads count. In the [HTTPServer] stanza of virtuoso.ini, set — [HTTPServer] ... ServerThreads = 100 MaxKeepAlives = 1000 KeepAliveTimeout = 10 Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, should not exceed the licensed thread count. Disk Use The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed). For the above described example, in the [Database] stanza of virtuoso.ini, set — [Database] ... Striping = 1 MaxCheckpointRemap = 2000000 — and in the [Striping] stanza, on one line per SegmentName, set — [Striping] ... Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6 As can be seen here, each file gets a background IO thread (the = qxxx clause). It should be noted that all files on the same physical device should have the same qxxx value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue. SQL Optimization If queries have lots of joins but access little data, as with the Berlin SPARQL Benchmark, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the [Parameters] stanza of virtuoso.ini, set — [Parameters] ... StopCompilerWhenXOverRunTime = 1</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Configuring Virtuoso for Benchmarking</div>
<p>I will here summarize what should be known about running benchmarks with <a href="http://virtuoso.openlinksw.com" id="link-id0xc152cf0">Virtuoso</a>.</p>

<h2>Physical Memory</h2>

<p>For 8G RAM, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set —</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 550000
</code>
</blockquote> 
<p>For 16G RAM, double this—</p>

<blockquote>
<code>
[Parameters]<br />
...<br />
NumberOfBuffers = 1100000
</code>
</blockquote> 

<h2>Transaction Isolation</h2>
<p>For most cases, certainly all <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xb7ba270">RDF</a> cases, <i>Read Committed</i> should be the default transaction isolation.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set —</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
DefaultIsolation = 2 
</code>
</blockquote> 

<h2>Multiuser Workload</h2>

<p>If <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x1a40f308">ODBC</a>, <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x1e003cf8">JDBC</a>, or similarly connected client applications are used, there must be more <code>ServerThreads</code> available than there will be client connections.  In the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set —</p> 
<blockquote>
<code> 
[Parameters]<br />
...<br />
ServerThreads = 100
</code>
</blockquote> 

<p>With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer <code>ServerThreads</code> than there are concurrent clients.  The <code>MaxKeepAlives</code> should be the maximum number of expected web clients.  This can be more than the <code>ServerThreads</code> count.  In the <code>[HTTPServer]</code> stanza of <code>virtuoso.ini</code>, set —</p> 
<blockquote>
<code> 
[HTTPServer]<br />
...<br />
ServerThreads    = 100 <br />
MaxKeepAlives    = 1000 <br />
KeepAliveTimeout = 10
</code>
</blockquote> 

<p>
<i><b>Note</b> — The <code>[HTTPServer] ServerThreads</code> are taken from the total pool made available by the <code>[Parameters] ServerThreads</code>.  Thus, the <code>[Parameters] ServerThreads</code> should always be at least as large as (and is best set greater than) the <code>[HTTPServer] ServerThreads</code>, and if using the closed-source Commercial Version, should not exceed the licensed thread count.</i>
</p> 

<h2>Disk Use</h2>

<p>The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID.  For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).  </p>

<p>For the above described example, in the <code>[Database]</code> stanza of <code>virtuoso.ini</code>, set —</p> 
<blockquote>
<code>
[Database]<br />
...<br />
Striping = 1<br />
MaxCheckpointRemap 	= 2000000 
</code>
</blockquote> 

<p>— and in the <code>[Striping]</code> stanza, on one line per <code>SegmentName</code>, set —</p> 
<blockquote>
<code>
[Striping]<br />
...<br />
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6</code>
</blockquote> 

<p>As can be seen here, each file gets a background IO thread (the <code>= q<i>xxx</i></code> clause).  It should be noted that all files on the same physical device should have the same <code>q<i>xxx</i></code> value.  This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.</p>

<h2>
<a href="http://dbpedia.org/resource/SQL" id="link-id0xc8b97c0">SQL</a> Optimization</h2>

<p>If queries have lots of joins but access little <a href="http://dbpedia.org/resource/Data" id="link-id0x193b2fa8">data</a>, as with the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1b283ca0">Berlin SPARQL Benchmark</a>, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far.  Thus, in the <code>[Parameters]</code> stanza of <code>virtuoso.ini</code>, set —</p> 
<blockquote>
<code>
[Parameters]<br />
...<br />
StopCompilerWhenXOverRunTime = 1
</code>
</blockquote> 
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1410">
  <rss:title>BSBM With Triples and Mapped Relational Data</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1410</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1410</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1410</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-06T19:41:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">BSBM With Triples and Mapped Relational Data The special contribution of the Berlin SPARQL Benchmark (BSBM) to the RDF world is to raise the question of doing OLTP with RDF. Of course, here we immediately hit the question of comparisons with relational databases. To this effect, BSBM also specifies a relational schema and can generate the data as either triples or SQL inserts. The benchmark effectively simulates the case of exposing an existing RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is beginning to call this semantic covers. The RDB2RDF XG, a W3C incubator group, has been active in this area since Spring, 2008. But why an OLTP workload with RDF to begin with? We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If data is online for human consumption, it may be online via a SPARQL end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case. Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here. What We Got First, we found that making the query plan took much too long in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit. But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the SPARQL could not be directly translated. If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor! We filled two Virtuoso instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in &quot;query mixes per hour&quot;. (An update or follow-on to this post will provide elapsed times for each test run.) With the unmodified benchmark we got: Physical Triples:     1297 qmph Mapped Triples:     3144 qmph In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.) The following were measured on the second run of a 100 query mix series, single test driver, warm cache. Physical Triples:     5746 qmph Mapped Triples:     7525 qmph We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time. Physical Triples:     19459 qmph Mapped Triples:     24531 qmph The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization. The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the StopCompilerWhenXOverRunTime = 1 option here to cut needless compiler overhead, the queries being straightforward enough. We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so. Suggestions for BSBM Reporting Rules. The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC. Multiuser operation. The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload. Add business intelligence. SPARQL has aggregates now, at least with Jena and Virtuoso, so let&#39;s use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot; For the SPARQL community, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a SPARQL protocol extension; the SPARUL syntax should also have a way of calling a procedure. Something like select proc (??, ??) would be enough, where ?? is a parameter marker, like ? in ODBC/JDBC. Add transactions.Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server. Comments on Query Mix The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales. Next We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">BSBM With Triples and Mapped Relational Data</div>
<p>The special contribution of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id10039db0">Berlin SPARQL Benchmark</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id106b2538">BSBM</a>) to the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id101a75f8">RDF</a> world is to raise the question of doing OLTP with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0xae54170">RDF</a>.</p>

<p>Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1e847b08">BSBM</a> also specifies a relational schema and can generate the <a href="http://dbpedia.org/resource/Data" id="link-id1206c378">data</a> as either triples or <a href="http://dbpedia.org/resource/SQL" id="link-id1667f040">SQL</a> inserts.</p>

<p>The benchmark effectively simulates the case of exposing an existing <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id10a93518">RDBMS</a> as RDF.  <a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id13e46d80">OpenLink Software</a> calls this <i>RDF Views</i>.  <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id12027578">Oracle</a> is beginning to call this <i>semantic covers</i>.  The <a href="http://www.w3.org/2005/Incubator/rdb2rdf/" id="link-id161dc678">RDB2RDF XG</a>, a W3C incubator group, has been active in this area since Spring, 2008.</p>

<h3>But why an OLTP workload with RDF to begin with?</h3>

<p>We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If <a href="http://dbpedia.org/resource/Data" id="link-id0x1e7119d8">data</a> is online for human consumption, it may be online via a <a href="http://dbpedia.org/resource/SPARQL" id="link-id106a8908">SPARQL</a> end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.</p>

<p>Warehousing all the world&#39;s publishable data as RDF is not our first preference, nor would it be the publisher&#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&#39;ll do here.</p>

<h3>What We Got </h3>

<p>First, we found that <a href="http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400" id="link-id150ea748">making the query plan took much too long</a> in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.</p>

<p>But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0xae5aff0">SPARQL</a> could not be directly translated.</p>

<p>If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!</p>

<p>We filled two <a href="http://virtuoso.openlinksw.com" id="link-id12dbdc70">Virtuoso</a> instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &quot;query mixes per hour&quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)</p>

<p>With the unmodified benchmark we got:</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>   </td>
    <td>1297 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>   </td>
   <td><b>3144 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)</p>

<p>The following were measured on the second run of a 100 query mix series, single test driver, warm cache.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>   </td>
    <td> 5746 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>   </td>
   <td> <b>7525 qmph</b>
   </td>
  </tr>
</table>
</blockquote>
<p>We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.</p>
<blockquote>
<table>
<tr>
   <td><i>Physical Triples:</i>
   </td>
    <td>   </td>
    <td> 19459 qmph</td>
  </tr>
<tr>
   <td><i>Mapped Triples:</i>
   </td>
    <td>   </td>
   <td> <b>24531 qmph</b>
   </td>
  </tr>
</table>
</blockquote>

<p>The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.</p>

<p>The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the <code>StopCompilerWhenXOverRunTime = 1</code> option here to cut needless compiler overhead, the queries being straightforward enough.</p>

<p>We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.</p>

<h3>Suggestions for BSBM</h3>

<ul>
 <li>
  <p>
    <b>Reporting Rules.</b> The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.</p>
 </li>

<li>
  <p>
    <b>Multiuser operation.</b>  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.</p>
</li>

<li>
  <p>
    <b>Add business intelligence.</b>  SPARQL has aggregates now, at least with <a href="http://jena.sourceforge.net/" id="link-id11a25ac0">Jena</a> and <a href="http://virtuoso.openlinksw.com" id="link-id0xb003180">Virtuoso</a>, so let&#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &quot;customers who bought this also bought xxx.&quot;</p>
</li>

<li>
  <p>
    <b>For the SPARQL community</b>, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a <a href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id109e2448">SPARQL protocol</a> extension; the SPARUL syntax should also have a way of calling a procedure.  Something like <code>select proc (??, ??)</code> would be enough, where <code>??</code> is a parameter marker, like <code>?</code> in <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id13febf48">ODBC</a>/<a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id120416a8">JDBC</a>.</p>
</li>

<li>
  <p>
    <b>Add transactions.</b>Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.</p>
</li>
</ul>

<h3>Comments on Query Mix</h3>

<p>The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.</p>

<h2>Next</h2>

<p>We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.</p>
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1403">
  <rss:title>Exploiting the RDF-based Linked Data Web using .NET via LINQ</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1403</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1403</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1403</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-08-01T17:58:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Exploiting the RDF-based Linked Data Web using .NET via LINQ Recently OpenLink has been investigating LinqToRdf, an exciting project from Andrew Matthews which aims to bring the Semantic Web to .NET. Because of their language bindings and heritage, existing RDF APIs such as Sesame, Jena and Redland predominantly favour non-Windows clients. Conversely Microsoft&#39;s ADO.NET Data Services provides a Redmond vision of exposing data on the Web but has no support for RDF. LinqToRdf is, as far as we&#39;re aware, the first serious effort to fill this gap and provide a bridge between Windows applications and the Semantic Web. OpenLink has produced a whitepaper Exploiting the RDF-based Linked Data Web using .NET via LINQ which provides a brief overview of LinqToRdf and an example of its use to retrieve data from the MusicBrainz music metadatabase via an OpenLink Virtuoso Quad Store. The document also illustrates the use of the Virtuoso Sponger, an &quot;RDFizer&quot; forming part of the RDF toolset provided with OpenLink Virtuoso Universal Server, to convert the raw MusicBrainz data to RDF on-the-fly. A further aim of the whitepaper is to draw attention to Andrew&#39;s excellent effort and hopefully tempt members of the Semantic Web community to contribute. Andrew was kind enough to incorporate some changes into LinqToRdf in response to OpenLink&#39;s testing. These have been included with major improvements of his own in a new release - LinqToRdf v0.8. Carl Blakeley</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<div>
<div style="display:none;">Exploiting the RDF-based Linked Data Web using .NET via LINQ</div>
    Recently OpenLink has been investigating <a href="http://code.google.com/p/linqtordf/" id="link-id0x20d8a248">LinqToRdf</a>, an exciting project from <a href="http://aabs.wordpress.com" id="link-id0x21f48218">Andrew Matthews</a> which aims to bring the Semantic Web to .NET. Because of their language bindings and heritage, existing RDF APIs such as Sesame, Jena and Redland predominantly favour non-Windows clients. Conversely Microsoft&#39;s ADO.NET Data Services provides a Redmond vision of exposing data on the Web but has no support for RDF. LinqToRdf is, as far as we&#39;re aware, the first serious effort to fill this gap and provide a bridge between Windows applications and the Semantic Web.<br /> <br />OpenLink has produced a whitepaper <a href="http://virtuoso.openlinksw.com/Whitepapers/html/linqtordf/linqtordf1.htm" id="link-id0x21f47348">Exploiting the RDF-based Linked Data Web using .NET via LINQ</a> which provides a brief overview of LinqToRdf and an example of its use to retrieve data from the <a href="http://musicbrainz.org" id="link-id0x21f49a88">MusicBrainz</a> music metadatabase via an OpenLink Virtuoso Quad Store. The document also illustrates the use of the <a href="http://virtuoso.openlinksw.com/Whitepapers/pdf/sponger_whitepaper_10102007.pdf" id="link-id0x21f92758">Virtuoso Sponger</a>, an &quot;RDFizer&quot; forming part of the RDF toolset provided with OpenLink Virtuoso Universal Server, to convert the raw MusicBrainz data to RDF on-the-fly. A further aim of the whitepaper is to draw attention to Andrew&#39;s excellent effort and hopefully tempt members of the Semantic Web community to contribute.<br /> <br />Andrew was kind enough to incorporate some changes into LinqToRdf in response to OpenLink&#39;s testing. These have been included with major improvements of his own in a new release - <a href="http://aabs.wordpress.com/2008/08/01/announcing-linqtordf-v08/" id="link-id0x21f465b8">LinqToRdf v0.8</a>.<br /> <br />Carl Blakeley<br />      
</div>]]></content:encoded>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuso Data Space Bot &lt;kidehen@openlinksw.com&gt;</dc:creator>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1401">
  <rss:title>Virtuoso Optimizations for the Berlin SPARQL Benchmark </rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1401</rss:link>
  <wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/mt-tb/Http/comments?id=1401</wfw:comment>
  <wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://virtuoso.openlinksw.com/blog/vdb/blog/gems/rsscomment.xml?:id=1401</wfw:commentRss>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2008-07-30T18:52:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso Optimizations for the Berlin SPARQL Benchmark We had a look at Chris Bizer&#39;s initial results with the Berlin SPARQL Benchmark (BSBM) on Virtuoso. The first results were rather bad, as nearly all of the run time was spent optimizing the SPARQL statements and under 10% actually running them. So I spent a couple of days on the SPARQL/SQL compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations. In fact, many of the queries in BSBM are not particularly sensitive to execution plan, as they access a very small portion of the database. So to close the matter, I put in a flag that makes the SQL compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far. With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially. With the compiler time cut-off in place (ini parameter StopCompilerWhenXOverRunTime = 1), we get the following times, output from the BSBM test driver: Starting test... 0: 1031.22 ms, total: 1151 ms 1: 982.89 ms, total: 1040 ms 2: 923.27 ms, total: 968 ms 3: 898.37 ms, total: 932 ms 4: 855.70 ms, total: 865 ms Scale factor: 10000 Number of query mix runs: 5 times min/max Query mix runtime: 0.8557 s / 1.0312 s Total runtime: 4.691 seconds QMpH: 3836.77 query mixes per hour CQET: 0.93829 seconds average runtime of query mix CQET (geom.): 0.93625 seconds geometric mean runtime of query mix Metrics for Query 1: Count: 5 times executed in whole run AQET: 0.012212 seconds (arithmetic mean) AQET(geom.): 0.009934 seconds (geometric mean) QPS: 81.89 Queries per second minQET/maxQET: 0.00684000s / 0.03115700s Average result count: 7.0 min/max result count: 3 / 10 Metrics for Query 2: Count: 35 times executed in whole run AQET: 0.030490 seconds (arithmetic mean) AQET(geom.): 0.029776 seconds (geometric mean) QPS: 32.80 Queries per second minQET/maxQET: 0.02467300s / 0.06753000s Average result count: 22.5 min/max result count: 15 / 30 Metrics for Query 3: Count: 5 times executed in whole run AQET: 0.006947 seconds (arithmetic mean) AQET(geom.): 0.006905 seconds (geometric mean) QPS: 143.95 Queries per second minQET/maxQET: 0.00580000s / 0.00795100s Average result count: 4.0 min/max result count: 0 / 10 Metrics for Query 4: Count: 5 times executed in whole run AQET: 0.008858 seconds (arithmetic mean) AQET(geom.): 0.008829 seconds (geometric mean) QPS: 112.89 Queries per second minQET/maxQET: 0.00804400s / 0.01019500s Average result count: 3.4 min/max result count: 0 / 10 Metrics for Query 5: Count: 5 times executed in whole run AQET: 0.087542 seconds (arithmetic mean) AQET(geom.): 0.087327 seconds (geometric mean) QPS: 11.42 Queries per second minQET/maxQET: 0.08165600s / 0.09889200s Average result count: 5.0 min/max result count: 5 / 5 Metrics for Query 6: Count: 5 times executed in whole run AQET: 0.131222 seconds (arithmetic mean) AQET(geom.): 0.131216 seconds (geometric mean) QPS: 7.62 Queries per second minQET/maxQET: 0.12924200s / 0.13298200s Average result count: 3.6 min/max result count: 3 / 5 Metrics for Query 7: Count: 20 times executed in whole run AQET: 0.043601 seconds (arithmetic mean) AQET(geom.): 0.040890 seconds (geometric mean) QPS: 22.94 Queries per second minQET/maxQET: 0.01984400s / 0.06012600s Average result count: 26.4 min/max result count: 5 / 96 Metrics for Query 8: Count: 10 times executed in whole run AQET: 0.018168 seconds (arithmetic mean) AQET(geom.): 0.016205 seconds (geometric mean) QPS: 55.04 Queries per second minQET/maxQET: 0.01097600s / 0.05066900s Average result count: 12.8 min/max result count: 6 / 20 Metrics for Query 9: Count: 20 times executed in whole run AQET: 0.043813 seconds (arithmetic mean) AQET(geom.): 0.043807 seconds (geometric mean) QPS: 22.82 Queries per second minQET/maxQET: 0.04274900s / 0.04504100s Average result count: 0.0 min/max result count: 0 / 0 Metrics for