<?xml version="1.0" encoding="UTF-8" ?>
<!--ATOM based XML document generated By OpenLink Virtuoso-->
<atom:feed xmlns:atom="http://www.w3.org/2005/Atom">
<atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/</atom:id>
<atom:title>OpenLink Virtuoso (Product Blog)</atom:title>
<atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/" type="text/html" rel="alternate" />
<atom:link href="http://virtuoso.openlinksw.com/weblogs/virtuoso/gems/atom.xml" type="application/atom+xml" rel="self" />
<atom:subtitle>A great place to track Virtuoso&#39;s rapid evolution.</atom:subtitle>
 <atom:author>
  <atom:name>Virtuso Data Space Bot</atom:name>
  <atom:email>kidehen@openlinksw.com</atom:email>
  </atom:author>
<atom:updated>2008-10-11T00:22:14Z</atom:updated>
<atom:generator>Virtuoso Universal Server 05.08.3034</atom:generator>
<atom:rights>OpenLink Software 1998-2006</atom:rights>
<atom:category term="&quot;virtual" />
<atom:category term="database&quot;" />
<atom:category term="&quot;enterprise" />
<atom:category term="information" />
<atom:category term="integration&quot;" />
<atom:category term="odbc" />
<atom:category term="jdbc" />
<atom:category term="sql" />
<atom:category term="soa" />
<atom:category term="&quot;web" />
<atom:category term="services&quot;" />
<atom:category term="soap" />
<atom:category term="federated" />
<atom:category term="database" />
<atom:category term="rdf" />
<atom:category term="sparql" />
<atom:category term="eii" />
<atom:category term="bpel" />
<atom:category term="bpm" />
<atom:category term="webdav" />
<atom:category term="sql" />
<atom:category term="http" />
<atom:category term="xml" />
<atom:category term="xquery" />
<atom:category term="xslt" />
<atom:logo>http://virtuoso.openlinksw.com/weblog/public/images/vbloglogo.gif</atom:logo>
 <atom:entry>
  <atom:title>Virtuoso Cluster Paper Update</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1451</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1451" type="text/html" rel="alternate" />
  <atom:published>2008-10-02T10:02:33Z</atom:published>
  <atom:content type="html">&lt;p&gt;An updated version of the paper about &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xc0abc50&quot;&gt;Virtuoso&lt;/a&gt; Cluster is available at &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf&quot; id=&quot;link-id16459248&quot;&gt;2008webscale_rdf.pdf&lt;/a&gt; &lt;/p&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="cluster" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-10-03T04:38:06.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso Update, Billion Triples and Outlook</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1450</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1450" type="text/html" rel="alternate" />
  <atom:published>2008-10-02T10:02:32Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Virtuoso Update, Billion Triples and Outlook&lt;/div&gt; &lt;p&gt;I will say a few things about what we have been doing and where we can go.&lt;/p&gt; &lt;p&gt;Firstly, we have a fairly scalable platform with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1aa82dc0&quot;&gt;Virtuoso&lt;/a&gt; 6 Cluster. It was most recently tested with the workload discussed in the previous &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445&quot; id=&quot;link-id1638a5b8&quot;&gt;Billion Triples post&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;There is an updated version of &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/2008webscale_rdf.pdf&quot; id=&quot;link-id16280a68&quot;&gt;the paper about this&lt;/a&gt;. This will be presented at the web scale workshop of ISWC 2008 in Karlsruhe.&lt;/p&gt; &lt;p&gt;Right now, we are polishing some things in Virtuoso 6 -- some optimizations for smarter balancing of interconnect traffic over multiple network interfaces, and some more &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1abd3f38&quot;&gt;SQL&lt;/a&gt; optimizations specific to &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1adbe410&quot;&gt;RDF&lt;/a&gt;. The must-have basics, like parallel running of sub-queries and aggregates, and all-around unrolling of loops of every kind into large partitioned batches, is all there and proven to work.&lt;/p&gt; &lt;p&gt;We spent a lot of time around the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1aaa0e78&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; story, so we got to the more advanced stuff like the &lt;a href=&quot;http://challenge.semanticweb.org/&quot; id=&quot;link-id0x1a860a50&quot;&gt;Billion Triples Challenge&lt;/a&gt; rather late. We did along the way also run &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1a27f2a8&quot;&gt;BSBM&lt;/a&gt; with an &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id0x1ad5c918&quot;&gt;Oracle&lt;/a&gt; back-end, with Virtuoso mapping &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1cf0e4a0&quot;&gt;SPARQL&lt;/a&gt; to SQL. This merits its own analysis in the near future. This will be the basic how-to of mapping OLTP systems to RDF. Depending on the case, one can use this for lookups in real-time or ETL.&lt;/p&gt; &lt;p&gt;RDF will deliver value in complex situations. An example of a complex relational mapping use case came from Ordnance Survey, presented at the &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id0x1ab96bb0&quot;&gt;RDB2RDF XG&lt;/a&gt;. Examples of complex warehouses include the &lt;a href=&quot;http://neurocommons.org/page/Main_Page&quot; id=&quot;link-id0x1adb2db0&quot;&gt;Neurocommons&lt;/a&gt; database, the Billion Triples Challenge, and the &lt;a href=&quot;http://www.garlik.com/&quot; id=&quot;link-id0x1925c7b0&quot;&gt;Garlik DataPatrol&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;In comparison, the Berlin workload is really simple and one where RDF is not at its best, as amply discussed on the &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1c6d1480&quot;&gt;Linked Data&lt;/a&gt; forum. BSBM&amp;#39;s primary value is as a demonstrator for the basic mapping tasks that will be repeated over and over for pretty much any online system when presence on the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1a937400&quot;&gt;data&lt;/a&gt; web becomes as indispensable as presence on the HTML web.&lt;/p&gt; &lt;p&gt;I will now talk about the complex warehouse/web-harvesting side. I will come to the mapping in another post.&lt;/p&gt; &lt;p&gt;Now, all the things shown in the &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1445&quot; id=&quot;link-id14de1d18&quot;&gt;Billion Triples post&lt;/a&gt; can be done with a relational system specially built for each purpose. Since we are a general purpose &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1a457c70&quot;&gt;RDBMS&lt;/a&gt;, we use this capability where it makes sense. For example, storing statistics about which tags or interests occur with which other tags or interests as RDF blank nodes makes no sense. We do not even make the experiment; we know ahead of time that the result is at least an order of magnitude in favor of the relational row-oriented solution in both space and time.&lt;/p&gt; &lt;p&gt;Whenever there is a data structure specially made for answering one specific question, like joint occurrence of tags, RDB and mapping is the way to go. With Virtuoso, this can fully-well coexist with physical triples, and can still be accessed in SPARQL and mixed with triples. This is territory that we have not extensively covered yet, but we will be giving some examples about this later.&lt;/p&gt; &lt;p&gt;The real value of RDF is in agility. When there is no time to design and load a new warehouse for every new question, RDF is unparalleled. Also SPARQL, once it has the necessary extensions of aggregating and sub-queries, is nicer than SQL, especially when we have sub-classes and sub-properties, transitivity, and &amp;quot;same as&amp;quot; enabled. These things have some run time cost and if there is a report one is hitting absolutely all the time, then chances are that resolving terms and identity at load-time and using materialized views in SQL is the reasonable thing. If one is inventing a new report every time, then RDF has a lot more convenience and flexibility.&lt;/p&gt; &lt;p&gt;We are just beginning to explore what we can do with data sets such as the online conversation space, linked data, and the open ontologies of &lt;a href=&quot;http://umbel.org/about/&quot; id=&quot;link-id0x1aa5ea18&quot;&gt;UMBEL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Cyc&quot; id=&quot;link-id0x1a631a20&quot;&gt;OpenCyc&lt;/a&gt;. It is safe to say that we can run with real world scale without loss of query expressivity. There is an incremental cost for performance but this is not prohibitive. Serving the whole billion triples set from memory would cost about $32K in hardware. $8K will do if one can wait for disk part of the time. One can use these numbers as a basis for costing larger systems. For online search applications, one will note that running the indexes pretty much from memory is necessary for flat response time. For back office analytics this is not necessarily as critical. It all depends on the use case.&lt;/p&gt; &lt;p&gt;We expect to be able to combine geography, social proximity, subject matter, and &lt;a href=&quot;http://dbpedia.org/resource/Named_entity_recognition&quot; id=&quot;link-id0x1aebdcc8&quot;&gt;named entities&lt;/a&gt;, with hierarchical taxonomies and traditional full text, and to present this through a simple user interface.&lt;/p&gt; &lt;p&gt;We expect to do this with online response times if we have a limited set of starting points and do not navigate more than 2 or 3 steps from each starting point. An example would be to have a full text pattern and news group, and get the cloud of interests from the authors of matching posts. Another would be to make a faceted view of the properties of the 1000 people most closely connected to one person.&lt;/p&gt; &lt;p&gt;Queries like finding the fastest online responders to questions about romance across the global board-scape, or finding the person who initiates the most long running conversations about crime, take a bit longer but are entirely possible.&lt;/p&gt; &lt;p&gt;The genius of RDF is to be able to do these things within a general purpose database, ad hoc, in a single query language, mostly without materializing intermediate results. Any of these things could be done with arbitrary efficiency in a custom built system. But what is special now is that the cost of access to this type of &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1ab88490&quot;&gt;information&lt;/a&gt; and far beyond drops dramatically as we can do these things in a far less labor intensive way, with a general purpose system, with no redesigning and reloading of warehouses at every turn. The query becomes a commodity.&lt;/p&gt; &lt;p&gt;Still, one must know what to ask. In this respect, the self-describing nature of RDF is unmatched. A query like &lt;i&gt;list the top 10 attributes with the most distinct values for all persons&lt;/i&gt; cannot be done in SQL. SQL simply does not allow the columns to be variable.&lt;/p&gt; &lt;p&gt;Further, we can accept queries as text, the way people are used to supplying them, and use structure for drill-down or result-relevance, and also recognize named entities and subject matter concepts in query text. Very simple NLP will go a long way towards keeping SPARQL out of the user experience.&lt;/p&gt; &lt;p&gt;The other way of keeping query complexity hidden is to publish hand-written SPARQL as parameter-fed canned reports.&lt;/p&gt; &lt;p&gt;Between now and ISWC 2008, the last week of October, we will put out demos showing some of these things. Stay tuned.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="cluster" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="rdf" />
  <atom:category term="oracle" />
  <atom:category term="semanticweb" />
  <atom:category term="web30" />
  <atom:category term="sparql" />
  <atom:category term="howto" />
  <atom:category term="virtuoso" />
  <atom:category term="dataspace" />
  <atom:updated>2008-10-02T12:47:07.4000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>OpenLink Software&#39;s Virtuoso Submission to the Billion Triples Challenge</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1446</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1446" type="text/html" rel="alternate" />
  <atom:published>2008-09-30T16:24:34Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;h2&gt;Introduction&lt;/h2&gt; &lt;p&gt;We use &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xb03e418&quot;&gt;Virtuoso&lt;/a&gt; 6 Cluster Edition to demonstrate the following:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Text and structured &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0xbd9dae8&quot;&gt;information&lt;/a&gt; based lookups&lt;/li&gt; &lt;li&gt;Analytics queries&lt;/li&gt; &lt;li&gt;Analysis of co-occurrence of features like interests and tags.&lt;/li&gt; &lt;li&gt;Dealing with identity of multiple IRI&amp;#39;s (&lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0xb383dd8&quot;&gt;owl&lt;/a&gt;:sameAs)&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;The demo is based on a set of canned &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xbda6298&quot;&gt;SPARQL&lt;/a&gt; queries that can be invoked using the &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0xbb292f0&quot;&gt;OpenLink Data Explorer&lt;/a&gt; (&lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0xc263528&quot;&gt;ODE&lt;/a&gt;) Firefox extension.&lt;/p&gt; &lt;p&gt;The demo queries can also be run directly against the SPARQL end point.&lt;/p&gt; &lt;p&gt;The demo is being worked on at the time of submission and may be shown online by appointment.&lt;/p&gt; &lt;p&gt;Automatic annotation of the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa173378&quot;&gt;data&lt;/a&gt; based on &lt;a href=&quot;http://dbpedia.org/resource/Named_entity_recognition&quot; id=&quot;link-id0xbdda558&quot;&gt;named entity extraction&lt;/a&gt; is being worked on at the time of this submission. By the time of ISWC 2008 the set of sample queries will be enhanced with queries based on extracted &lt;a href=&quot;http://dbpedia.org/resource/Named_entity_recognition&quot; id=&quot;link-id0xa66fbe0&quot;&gt;named entities&lt;/a&gt; and their relationships in the &lt;a href=&quot;http://umbel.org/about/&quot; id=&quot;link-id0xa06e2c8&quot;&gt;UMBEL&lt;/a&gt; and Open CYC ontologies. &lt;/p&gt; &lt;p&gt;Also examples involving owl:sameAs are being added, likewise with similarity metrics and search hit scores.&lt;/p&gt; &lt;h2&gt;The Data&lt;/h2&gt; &lt;p&gt;The database consists of the billion triples data sets and some additions like Umbel. Also the Freebase extract is newer than the challenge original.&lt;/p&gt; &lt;p&gt;The triple count is 1115 million.&lt;/p&gt; &lt;p&gt;In the case of web harvested resources, the data is loaded in one graph per resource.&lt;/p&gt; &lt;p&gt;In the case of larger data sets like &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0xc2bf770&quot;&gt;Dbpedia&lt;/a&gt; or the US census, all triples of the provenance share a data set specific graph.&lt;/p&gt; &lt;p&gt;All string literals are additionally indexed in a full text index. No stop words are used.&lt;/p&gt; &lt;p&gt;Most queries do not specify a graph. Thus they are evaluated against the union of all the graphs in the database. The indexing scheme is SPOG, GPOS, POGS, OPGS. All indices ending in S are bitmap indices. &lt;/p&gt; &lt;h2&gt;The Queries &lt;/h2&gt; &lt;p&gt;The demo uses Virtuoso SPARQL extensions in most queries. These extensions consist on one hand of well known &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xaf8cb40&quot;&gt;SQL&lt;/a&gt; features like aggregation with grouping and existence and value subqueries and on the other of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xafdceb8&quot;&gt;RDF&lt;/a&gt; specific features. The latter include run time RDFS and OWL inferencing support and backward chaining subclasses and transitivity. &lt;/p&gt; &lt;h3&gt;Simple Lookups&lt;/h3&gt; &lt;pre&gt;sparql select ?s ?p (bif:search_excerpt (bif:vector (&amp;#39;&lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0xbb64dd0&quot;&gt;semantic&amp;#39;, &amp;#39;web&lt;/a&gt;&amp;#39;), ?o)) where { ?s ?p ?o . filter (bif:contains (?o, &amp;quot;&amp;#39;semantic web&amp;#39;&amp;quot;)) } limit 10 ; &lt;/pre&gt; &lt;p&gt;This looks up triples with semantic web in the object and makes a search hit summary of the literal, highlighting the search terms. &lt;/p&gt; &lt;pre&gt;sparql select ?tp count(*) where { ?s ?p2 ?o2 . ?o2 a ?tp . ?s foaf:nick ?o . filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) } group by ?tp order by desc 2 limit 40 ; &lt;/pre&gt; &lt;p&gt;This looks at what sorts of things are referenced by the properties of the foaf handle plaid_skirt.&lt;/p&gt; &lt;p&gt;What are these things called?&lt;/p&gt; &lt;pre&gt;sparql select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 rdfs:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) } group by ?lbl order by desc 2 ; &lt;/pre&gt; &lt;p&gt;Many of these things do not have a rdfs:label. Let us use a more general concept of lable which groups dc:title, foaf:name and other name-like properties together. The subproperties are resolved at run time, there is no materialization. &lt;/p&gt; &lt;pre&gt;sparql define input:inference &amp;#39;b3s&amp;#39; select ?lbl count(*) where { ?s ?p2 ?o2 . ?o2 b3s:label ?lbl . ?s foaf:nick ?o . filter (bif:contains (?o, &amp;quot;plaid_skirt&amp;quot;)) } group by ?lbl order by desc 2 ; &lt;/pre&gt; &lt;p&gt;We can list sources by the topics they contain. Below we look for graphs that mention terrorist bombing. &lt;/p&gt; &lt;pre&gt;sparql select ?g count(*) where { graph ?g { ?s ?p ?o . filter (bif:contains (?o, &amp;quot;&amp;#39;terrorist bombing&amp;#39;&amp;quot;)) } } group by ?g order by desc 2 ; &lt;/pre&gt; &lt;p&gt;Now some web 2.0 tagging of search results. The &lt;a href=&quot;http://dbpedia.org/resource/Tag&quot; id=&quot;link-id0xa8b89f8&quot;&gt;tag&lt;/a&gt; cloud of &amp;quot;computer&amp;quot;&lt;/p&gt; &lt;pre&gt;sparql select ?lbl count (*) where { ?s ?p ?o . ?o bif:contains &amp;quot;computer&amp;quot; . ?s sioc:topic ?tg . optional { ?tg rdfs:label ?lbl } } group by ?lbl order by desc 2 limit 40 ; &lt;/pre&gt; &lt;p&gt;This query will find the posters who talk the most about sex.&lt;/p&gt; &lt;pre&gt;sparql select ?auth count (*) where { ?d dc:creator ?auth . ?d ?p ?o filter (bif:contains (?o, &amp;quot;sex&amp;quot;)) } group by ?auth order by desc 2 ; &lt;/pre&gt; &lt;h3&gt;Analytics &lt;/h3&gt; &lt;p&gt;We look for people who are joined by having relatively uncommon interests but do not know each other.&lt;/p&gt; &lt;pre&gt;sparql select ?i ?cnt ?n1 ?n2 ?p1 ?p2 where { { select ?i count (*) as ?cnt where { ?p foaf:interest ?i } group by ?i } filter ( ?cnt &amp;gt; 1 &amp;amp;&amp;amp; ?cnt &amp;lt; 10) . ?p1 foaf:interest ?i . ?p2 foaf:interest ?i . filter (?p1 != ?p2 &amp;amp;&amp;amp; !bif:exists ((select (1) where {?p1 foaf:knows ?p2 })) &amp;amp;&amp;amp; !bif:exists ((select (1) where {?p2 foaf:knows ?p1 }))) . ?p1 foaf:nick ?n1 . ?p2 foaf:nick ?n2 . } order by ?cnt limit 50 ; &lt;/pre&gt; &lt;p&gt;The query takes a fairly long time, mostly spent counting the interested in 25M interest triples. It then takes people that share the interest and checks that neither claims to know the other. It then sorts the results rarest interest first. The query can be written more efficently but is here just to show that database-wide scans of the population are possible ad hoc. &lt;/p&gt; &lt;p&gt;Now we go to SQL to make a tag co-occurrence matrix. This can be used for showing a Technorati-style related tags line at the bottom of a search result page. This showcases the use of SQL together with SPARQL. The half-matrix of tags t1, t2 with the co-occurrence count at the intersection is much more efficiently done in SQL, specially since it gets updated as the data changes. This is an example of materialized intermediate results based on warehoused RDF. &lt;/p&gt; &lt;pre&gt;create table tag_count (tcn_tag iri_id_8, tcn_count int, primary key (tcn_tag)); alter index tag_count on tag_count partition (tcn_tag int (0hexffff00)); create table tag_coincidence (tc_t1 iri_id_8, tc_t2 iri_id_8, tc_count int, tc_t1_count int, tc_t2_count int, primary key (tc_t1, tc_t2)) alter index tag_coincidence on tag_coincidence partition (tc_t1 int (0hexffff00)); create index tc2 on tag_coincidence (tc_t2, tc_t1) partition (tc_t2 int (0hexffff00)); &lt;/pre&gt; &lt;p&gt;How many times each topic is mentioned?&lt;/p&gt; &lt;pre&gt; insert into tag_count select * from (sparql define output:valmode &amp;quot;LONG&amp;quot; select ?t count (*) as ?cnt where { ?s sioc:topic ?t } group by ?t) xx option (quietcast); &lt;/pre&gt; &lt;p&gt;Take all t1, t2 where t1 and t2 are tags of the same subject, store only the permutation where the internal id of t1 &amp;lt; that of t2.&lt;/p&gt; &lt;pre&gt;insert into tag_coincidence (tc_t1, tc_t2, tc_count) select &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;, cnt from (select &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;, count (*) as cnt from (sparql define output:valmode &amp;quot;LONG&amp;quot; select ?t1 ?t2 where { ?s sioc:topic ?t1 . ?s sioc:topic ?t2 }) tags where &amp;quot;t1&amp;quot; &amp;lt; &amp;quot;t2&amp;quot; group by &amp;quot;t1&amp;quot;, &amp;quot;t2&amp;quot;) xx where isiri_id (&amp;quot;t1&amp;quot;) and isiri_id (&amp;quot;t2&amp;quot;) option (quietcast); &lt;/pre&gt; &lt;p&gt;Now put the individual occurrence counts into the same table with the co-occurrence. This denormalization makes the related tags lookup faster. &lt;/p&gt; &lt;pre&gt;update tag_coincidence set tc_t1_count = (select tcn_count from tag_count where tcn_tag = tc_t1), tc_t2_count = (select tcn_count from tag_count where tcn_tag = tc_t2); &lt;/pre&gt; &lt;p&gt;Now each tag_coincidence row has the joint occurrence count and individual occurrence counts. A single select will return a Technorati-style related tags listing. &lt;/p&gt; &lt;p&gt;To show the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x9d4bc60&quot;&gt;URI&lt;/a&gt;&amp;#39;s of the tags: &lt;/p&gt; &lt;pre&gt;select top 10 id_to_iri (tc_T1), id_to_iri (tc_t2), tc_count from tag_coincidence order by tc_count desc; &lt;/pre&gt; &lt;h3&gt;Social Networks &lt;/h3&gt; &lt;p&gt;We look at what interests people have &lt;/p&gt; &lt;pre&gt;sparql select ?o ?cnt where { { select ?o count (*) as ?cnt where { ?s foaf:interest ?o } group by ?o } filter (?cnt &amp;gt; 100) } order by desc 2 limit 100 ; &lt;/pre&gt; &lt;p&gt;Now the same for the Harry Potter fans &lt;/p&gt; &lt;pre&gt;sparql select ?i2 count (*) where { ?p foaf:interest &amp;lt;&lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0xba0b390&quot;&gt;http&lt;/a&gt;://www.livejournal.com/interests.bml?int=harry+potter&amp;gt; . ?p foaf:interest ?i2 } group by ?i2 order by desc 2 limit 20 ; &lt;/pre&gt; &lt;p&gt;We see whether knows relations are symmmetrical. We return the top n people that others claim to know without being reciprocally known.&lt;/p&gt; &lt;pre&gt;sparql select ?celeb, count (*) where { ?claimant foaf:knows ?celeb . filter (!bif:exists ((select (1) where { ?celeb foaf:knows ?claimant }))) } group by ?celeb order by desc 2 limit 10 ; &lt;/pre&gt; &lt;p&gt;We look for a well connected person to start from.&lt;/p&gt; &lt;pre&gt;sparql select ?p count (*) where { ?p foaf:knows ?k } group by ?p order by desc 2 limit 50 ; &lt;/pre&gt; &lt;p&gt;We look for the most connected of the many online identities of Stefan Decker.&lt;/p&gt; &lt;pre&gt;sparql select ?sd count (distinct ?xx) where { ?sd a foaf:Person . ?sd ?name ?ns . filter (bif:contains (?ns, &amp;quot;&amp;#39;Stefan Decker&amp;#39;&amp;quot;)) . ?sd foaf:knows ?xx } group by ?sd order by desc 2 ; &lt;/pre&gt; &lt;p&gt;We count the transitive closure of Stefan Decker&amp;#39;s connections &lt;/p&gt; &lt;pre&gt;sparql select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &amp;lt;mailto:stefan.decker@deri.org&amp;gt;) } ; &lt;/pre&gt; &lt;p&gt;Now we do the same while following owl:sameAs links.&lt;/p&gt; &lt;pre&gt;sparql define input:same-as &amp;quot;yes&amp;quot; select count (*) where { { select * where { ?s foaf:knows ?o } } option (transitive, t_distinct, t_in(?s), t_out(?o)) . filter (?s = &amp;lt;mailto:stefan.decker@deri.org&amp;gt;) } ; &lt;/pre&gt; &lt;h2&gt;Demo System&lt;/h2&gt; &lt;p&gt;The system runs on Virtuoso 6 Cluster Edition. The database is partitioned into 12 partitions, each served by a distinct server process. The system demonstrated hosts these 12 servers on 2 machines, each with 2 xXeon 5345 and 16GB memory and 4 SATA disks. For scaling, the processes and corresponding partitions can be spread over a larger number of machines. If each ran on its own server with 16GB RAM, the whole data set could be served from memory. This is desirable for search engine or fast analytics applications. Most of the demonstrated queries run in memory on second invocation. The timing difference between first and second run is easily an order of magnitude. &lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="cluster" />
  <atom:category term="webservices" />
  <atom:category term="web2.0" />
  <atom:category term="web20" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="foaf" />
  <atom:category term="sioc" />
  <atom:category term="sparql" />
  <atom:category term="socialnetworking" />
  <atom:category term="openlink" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-10-03T06:20:48.94000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Requirements for Relational-to-RDF Mapping</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1436</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1436" type="text/html" rel="alternate" />
  <atom:published>2008-09-08T09:41:25Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Requirements for Relational-to-RDF Mapping&lt;/div&gt; &lt;p&gt;Many of you will know about the W3C relational-to-&lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1e1be0a8&quot;&gt;RDF&lt;/a&gt; mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational-to-RDF mapping.&lt;/p&gt; &lt;p&gt;To this effect, I recently summarized the group discussions and some of our own experiences around the topic at &amp;lt;&lt;a href=&quot;http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling&quot; id=&quot;link-id146030e8&quot;&gt;http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling&lt;/a&gt;&amp;gt;.&lt;/p&gt; &lt;p&gt;I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis à vis the following points, even if any working group will unavoidably encounter these issues on the way. A &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x1e6b3950&quot;&gt;blog&lt;/a&gt; post on the other hand can be more specific.&lt;/p&gt; &lt;p&gt;I gave a talk to the &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id0xa0932c68&quot;&gt;RDB2RDF XG&lt;/a&gt; this spring, with these &lt;a href=&quot;http://virtuoso.openlinksw.com/wiki/main/Main/VirtPresentations/Relational2RDF.ppt&quot; id=&quot;link-id14572540&quot;&gt;slides&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The main point is that people would really like to map on-the-fly, if they only could. Making an RDF warehouse is not of value in itself, but it is true that in some cases this cannot be avoided.&lt;/p&gt; &lt;p&gt;At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on-the-fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language. &lt;/p&gt; &lt;h2&gt;Eliminating JOINs&lt;/h2&gt; &lt;p&gt;We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are &lt;code&gt;JOIN&lt;/code&gt;ing using such triples, we end up with &lt;code&gt;UNION&lt;/code&gt;s of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on-demand mapping case, there must be a lot of type inference logic that is not relevant in the ETL scenario.&lt;/p&gt; &lt;p&gt;To make the point clearer, suppose a query like &amp;quot;list the organizations whose representatives have published about &lt;i&gt;xx&lt;/i&gt;.&amp;quot; Suppose that there are three databases mapped, all of which have a table of organizations, a table of persons with affiliation to organizations, a table of publications by these persons, and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with &lt;a href=&quot;http://dbpedia.org/resource/Tag&quot; id=&quot;link-id0xa0977bf0&quot;&gt;tag&lt;/a&gt; &lt;i&gt;XX&lt;/i&gt;. It is a matter of common sense in this scenario that a publication will have the author and the author&amp;#39;s affiliation in the same database. However, the RDB-to-RDF mapping does not necessarily know this, if all that it is told is that a table makes IRIs of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRIs from one mapping are disjoint from IRIs from another: A paper in database &lt;i&gt;X&lt;/i&gt; will usually not have an author in database &lt;i&gt;Y&lt;/i&gt;. The IDs in database &lt;i&gt;Y&lt;/i&gt;, even if perchance equal to the IDs in &lt;i&gt;X&lt;/i&gt;, do not mean the same thing, and there is no point joining across databases by them.&lt;/p&gt; &lt;p&gt;This entire question is a non-issue in the ETL scenario, but is absolutely vital in the real-time mapping. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRIs with one pattern, and keys from another using another pattern, it must be inferable from the patterns whether the sets of IRIs will be disjoint.&lt;/p&gt; &lt;p&gt;This is critical. Otherwise we will be joining everything to everything else, and there will be orders of magnitude of penalty compared to hand-crafted &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xa09490f8&quot;&gt;SQL&lt;/a&gt; over the same &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa095efd0&quot;&gt;data&lt;/a&gt; sources.&lt;/p&gt; &lt;h2&gt;Expectations and Limitations on Queries&lt;/h2&gt; &lt;p&gt; &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1e360230&quot;&gt;SPARQL&lt;/a&gt; queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class, when there are few columns that can map to a given predicate, and when classes and predicates are literals in the query.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1f5edb30&quot;&gt;Virtuoso&lt;/a&gt; has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso&amp;#39;s SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third-party SQL, no such tricks can be used.&lt;/p&gt; &lt;p&gt;The above example suggests that for supporting on-the-fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run-time cost and complexity.&lt;/p&gt; &lt;p&gt;But we must not lose the baby with the bath-water. Aside from offering global identifiers, RDF&amp;#39;s attractions include subclasses and sub-predicates. In relational terms, these translate to &lt;code&gt;UNION&lt;/code&gt;s and do involve some added cost. A mapping system just has to have means of dealing with this cost, and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well-behaved subsets of SPARQL and mappings.&lt;/p&gt; &lt;h2&gt;ETL Ou Ne Pas ETL?&lt;/h2&gt; &lt;p&gt;Whether to warehouse or not? If one has hundreds of sources, of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last year&amp;#39;s RDB-to-RDF mapping workshop in Cambridge, Massachusetts, about a system of relational mapping and on-demand RDF-izers of diverse semi-structured biomedical data, e.g., spreadsheets. The issue certainly exists, and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores, and another part comes from a less structured repository of ETLed physical triples.&lt;/p&gt; &lt;p&gt;Our take is that if something is a large or very large relational store, then map; else, ETL. With Virtuoso, we can mix mapped and local triples, but this is not a generally available feature of triple stores and standardization will likely have to wait until there are more implementations.&lt;/p&gt; &lt;h2&gt;Conclusions&lt;/h2&gt; &lt;ul&gt; &lt;li&gt;If you map on demand, watch out for an explosion of &lt;code&gt;UNION&lt;/code&gt;s when integrating sources that talk of similar things.&lt;/li&gt; &lt;li&gt;If you integrate lots of sources, some ETL is likely unavoidable. Look for ways of dealing with part ETL, part mapping. ETLing everything is not always best or even possible.&lt;/li&gt; &lt;li&gt;If you map a single fairly-clean RDB to RDF, mapping will work well, potentially much faster than triple storage. Higher storage density and more data per index lookup on the relational side.&lt;/li&gt; &lt;li&gt;If you map on demand, some restrictions to SPARQL may be practically necessary. These have to do with variables in predicate position, variables in class position, etc. Individual implementations may support these, but standardization will likely have to put limits on them.&lt;/li&gt; &lt;/ul&gt; &lt;p&gt;This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-09-08T15:03:09.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Transitivity and Graphs for SQL</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1435</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1435" type="text/html" rel="alternate" />
  <atom:published>2008-09-08T09:41:24Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Transitivity and Graphs for SQL&lt;/div&gt; &lt;h2&gt;Background&lt;/h2&gt; &lt;p&gt;I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xa1a18c58&quot;&gt;SQL&lt;/a&gt; query language.&lt;/p&gt; &lt;p&gt;The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these the features of a platform instead.&lt;/p&gt; &lt;p&gt;It is now time to apply this principle to graph traversal.&lt;/p&gt; &lt;p&gt;The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xaf8c730&quot;&gt;data&lt;/a&gt; structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.&lt;/p&gt; &lt;p&gt;The ad-hoc nature and very large volume of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xae41ef0&quot;&gt;RDF&lt;/a&gt; data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If &lt;i&gt;a&lt;/i&gt; was part of &lt;i&gt;b&lt;/i&gt;, and &lt;i&gt;b&lt;/i&gt; part of &lt;i&gt;&lt;a href=&quot;http://dbpedia.org/resource/C_(programming_language)&quot; id=&quot;link-id0xac9d8790&quot;&gt;c&lt;/a&gt;&lt;/i&gt;, the implied fact that &lt;i&gt;a&lt;/i&gt; is part of &lt;i&gt;c&lt;/i&gt; would be inserted explicitly into the database as a pre-query step.&lt;/p&gt; &lt;p&gt;This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad-hoc.&lt;/p&gt; &lt;p&gt;Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed. This is why with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xb68f9d0&quot;&gt;Virtuoso&lt;/a&gt; we have tended to favor inference on demand (&amp;quot;backward chaining&amp;quot;) and mapping of relational data into RDF without copying.&lt;/p&gt; &lt;p&gt;The SQL world has taken steps towards dealing with recursion with the &lt;code&gt;WITH - UNION&lt;/code&gt; construct which allows definition of recursive views. The idea there is to define, for example, a tree walk as a &lt;code&gt;UNION&lt;/code&gt; of the data of the starting node plus the recursive walk of the starting node&amp;#39;s immediate children.&lt;/p&gt; &lt;p&gt;The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving &lt;code&gt;JOIN&lt;/code&gt;s between such recursive views. This model of recursion seems to lose SQL&amp;#39;s non-procedural nature. One can no longer easily rearrange &lt;code&gt;JOIN&lt;/code&gt;s based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.&lt;/p&gt; &lt;p&gt;Take a question like &amp;quot;list the parts of products of category &lt;i&gt;C&lt;/i&gt; which have materials that are classified as toxic.&amp;quot; Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &amp;quot;toxic&amp;quot; has a multilevel substructure.&lt;/p&gt; &lt;p&gt;Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category. One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth — regular cost based optimization.&lt;/p&gt; &lt;p&gt;Especially with RDF, there are many problems of this type. In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.&lt;/p&gt; &lt;p&gt;In Virtuoso, we see &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xb3bdcc0&quot;&gt;SPARQL&lt;/a&gt; as reducing to SQL. Any RDF-oriented database-engine or query-optimization feature is accessed via SQL. Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, &lt;i&gt;ipso facto&lt;/i&gt;, an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.&lt;/p&gt; &lt;h2&gt; SQL and Transitivity &lt;/h2&gt; &lt;p&gt;We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., &lt;code&gt;SELECT&lt;/code&gt; in another &lt;code&gt;SELECT&lt;/code&gt;&amp;#39;s &lt;code&gt;FROM&lt;/code&gt; clause, with a &lt;code&gt;TRANSITIVE&lt;/code&gt; clause.&lt;/p&gt; &lt;p&gt;Consider the data:&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;&lt;code&gt;CREATE TABLE &amp;quot;knows&amp;quot; (&amp;quot;p1&amp;quot; INT, &amp;quot;p2&amp;quot; INT, PRIMARY KEY (&amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot;) ); ALTER INDEX &amp;quot;knows&amp;quot; ON &amp;quot;knows&amp;quot; PARTITION (&amp;quot;p1&amp;quot; INT); CREATE INDEX &amp;quot;knows2&amp;quot; ON &amp;quot;knows&amp;quot; (&amp;quot;p2&amp;quot;, &amp;quot;p1&amp;quot;) PARTITION (&amp;quot;p2&amp;quot; INT); &lt;/code&gt; &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;We represent a social network with the many-to-many relation &amp;quot;knows&amp;quot;. The persons are identified by integers.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;&lt;code&gt;INSERT INTO &amp;quot;knows&amp;quot; VALUES (1, 2); INSERT INTO &amp;quot;knows&amp;quot; VALUES (1, 3); INSERT INTO &amp;quot;knows&amp;quot; VALUES (2, 4);&lt;/code&gt; &lt;/pre&gt; &lt;pre&gt;&lt;code&gt;SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot; FROM &amp;quot;knows&amp;quot; ) &amp;quot;k&amp;quot; WHERE &amp;quot;k&amp;quot;.&amp;quot;p1&amp;quot; = 1;&lt;/code&gt;&lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;We obtain the result:&lt;/p&gt; &lt;blockquote&gt; &lt;table width=&quot;100&quot;&gt; &lt;tr&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;The operation is reversible:&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;&lt;code&gt;SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot; FROM &amp;quot;knows&amp;quot; ) &amp;quot;k&amp;quot; WHERE &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot; = 4; &lt;/code&gt; &lt;/pre&gt; &lt;table width=&quot;100&quot;&gt; &lt;tr&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;Since now we give &lt;i&gt;p2&lt;/i&gt;, we traverse from &lt;i&gt;p2&lt;/i&gt; towards &lt;i&gt;p1&lt;/i&gt;. The result set states that 4 is known by 2 and 2 is known by 1.&lt;/p&gt; &lt;p&gt;To see what would happen if &lt;i&gt;x&lt;/i&gt; knowing &lt;i&gt;y&lt;/i&gt; also meant &lt;i&gt;y&lt;/i&gt; knowing &lt;i&gt;x&lt;/i&gt;, one could write:&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;&lt;code&gt;SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot; FROM (SELECT &amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot; FROM &amp;quot;knows&amp;quot; UNION ALL SELECT &amp;quot;p2&amp;quot;, &amp;quot;p1&amp;quot; FROM &amp;quot;knows&amp;quot; ) &amp;quot;k2&amp;quot; ) &amp;quot;k&amp;quot; WHERE &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot; = 4;&lt;/code&gt; &lt;/pre&gt; &lt;table width=&quot;100&quot;&gt; &lt;tr&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;Now, since we know that 1 and 4 are related, we can ask how they are related.&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;&lt;code&gt;SELECT * FROM (SELECT TRANSITIVE T_IN (1) T_OUT (2) T_DISTINCT &amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot;, T_STEP (1) AS &amp;quot;via&amp;quot;, T_STEP (&amp;#39;step_no&amp;#39;) AS &amp;quot;step&amp;quot;, T_STEP (&amp;#39;path_id&amp;#39;) AS &amp;quot;path&amp;quot; FROM &amp;quot;knows&amp;quot; ) &amp;quot;k&amp;quot; WHERE &amp;quot;p1&amp;quot; = 1 AND &amp;quot;p2&amp;quot; = 4;&lt;/code&gt; &lt;/pre&gt; &lt;table width=&quot;250&quot;&gt; &lt;tr&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;via&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;step&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;path&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., &lt;i&gt;p1&lt;/i&gt;, has number 0. Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.&lt;/p&gt; &lt;p&gt;For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: &lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt;&lt;code&gt;SELECT p2, dist, (SELECT COUNT (*) FROM &amp;quot;knows&amp;quot; &amp;quot;c&amp;quot; WHERE &amp;quot;c&amp;quot;.&amp;quot;p1&amp;quot; = &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot; ) FROM (SELECT TRANSITIVE t_in (1) t_out (2) t_distinct &amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot;, t_step (&amp;#39;step_no&amp;#39;) AS &amp;quot;dist&amp;quot; FROM &amp;quot;knows&amp;quot; ) &amp;quot;k&amp;quot; WHERE &amp;quot;p1&amp;quot; = 1 ORDER BY &amp;quot;dist&amp;quot;, 3 DESC;&lt;/code&gt; &lt;/pre&gt; &lt;table width=&quot;150&quot;&gt; &lt;tr&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;dist&lt;/th&gt; &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;aggregate&lt;/th&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt; &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;h2&gt;How?&lt;/h2&gt; &lt;p&gt;The queries shown above work on Virtuoso v6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.&lt;/p&gt; &lt;p&gt;Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection. Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.&lt;/p&gt; &lt;p&gt;Writing a generic database driven graph traversal framework on the application side, say in Java over &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0xa8a9ef8&quot;&gt;JDBC&lt;/a&gt;, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query. Besides, the traversal order in such a case could not be optimized by the DBMS.&lt;/p&gt; &lt;h2&gt;Next&lt;/h2&gt; &lt;p&gt;In a future &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0xb526a40&quot;&gt;blog&lt;/a&gt; post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="rdf" />
  <atom:category term="jdbc" />
  <atom:category term="sql" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="howto" />
  <atom:category term="history" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-09-08T15:43:07.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Epistemology of the Sponger, or How Virtuoso Drives a Web Query</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1432</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1432" type="text/html" rel="alternate" />
  <atom:published>2008-09-05T09:20:56Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Epistemology of the Sponger, or How Virtuoso Drives a Web Query&lt;/div&gt; &lt;p&gt; &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1ed6cf28&quot;&gt;Virtuoso&lt;/a&gt; has an extensive collection of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1f8d1f78&quot;&gt;RDF&lt;/a&gt;-izers called Sponger Cartridges. These take a web resource in one of 30+ formats (so far) and extract RDF from it. The Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x1edc90e8&quot;&gt;Sponger&lt;/a&gt; is a device which evaluates a query and along the way, finds dereferenceable links, dereferences them, and iteratively re-evaluates the query, until either nothing new is found or some limit is reached.&lt;/p&gt; &lt;p&gt;We could call this &lt;i&gt;query-driven crawling&lt;/i&gt;. The idea is intuitive — what one looks for, determines what one finds.&lt;/p&gt; &lt;p&gt;This does however raise certain questions pertaining to the nature and ultimate possibility of &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1f836b68&quot;&gt;knowledge&lt;/a&gt;, i.e., epistemology.&lt;/p&gt; &lt;p&gt;The process of querying could be said to go from the few to the many, just like the process of harvesting &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1edb1648&quot;&gt;data&lt;/a&gt; from the web, the way any search engine does. One follows links or makes joins and thereby increases one&amp;#39;s reach.&lt;/p&gt; &lt;p&gt;The difference is that a query has no &lt;i&gt;a priori&lt;/i&gt; direction. If I ask for the phone numbers of my friends and there are no phone numbers in the database, then it is valid to give an empty result without looking at my friends at all. &lt;a href=&quot;http://dbpedia.org/resource/Closed_world_assumption&quot; id=&quot;link-id0x1edf1f30&quot;&gt;Closed world&lt;/a&gt;, as it is said. Never mind that the friends would have had a &amp;quot;see also&amp;quot; link to a retrievable document that did have a phone number.&lt;/p&gt; &lt;p&gt;The problem is that a query execution plan determines what possible dereferenceable material the query will encounter during its execution. What is worse, a query plan tends toward the minimal, i.e., toward minimizing the chances of encountering something dereferenceable along the way. Where query and crawl appeared to have a similarity, in fact they have two opposite goals.&lt;/p&gt; &lt;p&gt;The user generally has no idea of the execution plan. In the general case, the user &lt;i&gt;cannot&lt;/i&gt; have an idea of this plan. There are valid, over 40 year old reasons for leaving the query planning to the database. In exceptional situations the user can read or direct these, but this is really quite tedious and requires understanding that is basically never present.&lt;/p&gt; &lt;p&gt;So, given a query, how do we find data that will match it, short of having a pre-loaded database of absolutely everything? This is certainly a desirable goal, and all in the &lt;a href=&quot;http://dbpedia.org/resource/Open_world_assumption&quot; id=&quot;link-id0x1eb46548&quot;&gt;open world&lt;/a&gt;, distributed spirit of the web.&lt;/p&gt; &lt;p&gt;Let us limit ourselves to queries that have some literals in the object or subject positions. A &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1ed293f8&quot;&gt;SPARQL&lt;/a&gt; query is basically a graph. Its vertices are variables and literals, and its edges are triple patterns. An edge is labeled by a predicate. For now, we will consider the predicate to always be a literal. From each literal, we can draw a tree, following each edge starting at this literal and descending until we find another literal. Each tree is not always a spanning tree of the graph, but all the trees collectively span the graph.&lt;/p&gt; &lt;p&gt;Consider the query &lt;/p&gt; &lt;blockquote&gt; &lt;code&gt;{ &amp;lt;john&amp;gt; knows ?x . &amp;lt;mary&amp;gt; knows ?x . ?x label ?l }.&lt;/code&gt; &lt;/blockquote&gt; The starting points are the literals &lt;code&gt;john&lt;/code&gt; and &lt;code&gt;mary&lt;/code&gt;. The &lt;code&gt;john&lt;/code&gt; tree has one child, &lt;code&gt;?x&lt;/code&gt;, which has the children &lt;code&gt;mary&lt;/code&gt; and &lt;code&gt;?l&lt;/code&gt;. One could notate it as &lt;blockquote&gt; &lt;code&gt;{ &amp;lt;john&amp;gt; knows ?x . {{ &amp;lt;mary&amp;gt; knows ?x} UNION {?x label ?l}}}&lt;/code&gt; &lt;/blockquote&gt; That is, the head first, and if it has more than one child, a union listing them, recursively. &lt;p&gt;If one composed such queries for each literal in the original pattern and evaluated each as a breadth first walk of the tree, no query optimization tricks, and for each binding of each variable, recorded whether there was something to dereference, one would in a finite time have reached all the directly reachable data. Then one could evaluate the original query, using whatever plan was preferred.&lt;/p&gt; &lt;p&gt;The check for dereferenceable data applied to each IRI-valued binding formed in the above evaluation, would consist of looking for &amp;quot;see also&amp;quot;, &amp;quot;same as&amp;quot;, and other such properties of the IRI. It could also consult text based search engines. Since the evaluation is breadth first, it generates a large number of parallel tasks and is fairly latency tolerant, i.e., it will not die if it must retrieve a few pages from remote sources. We will leave the exact rewrite rules for unions, optionals, aggregates, subqueries, and so on, as an exercise; the general idea should be clear enough.&lt;/p&gt; &lt;p&gt;We have here shown a way of transforming SPARQL queries in such a way as to guarantee dereferencing of findable links, without requiring the end user to either explicitly specify or understand query plans.&lt;/p&gt; &lt;p&gt;The present Sponger does not work exactly in this manner but it will be developed in this direction. Fortunately, the algorithms outlined above are nothing complicated.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-09-05T16:04:28.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>A quick look at SP2B, the SPARQL Performance Benchmark</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1423</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1423" type="text/html" rel="alternate" />
  <atom:published>2008-08-27T16:03:40Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;A quick look at SP2B, the SPARQL Performance Benchmark&lt;/div&gt; &lt;p&gt;I finally got around to running the &lt;a href=&quot;http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B&quot; id=&quot;link-id17bac628&quot;&gt;SP&lt;sup&gt;2&lt;/sup&gt;B SPARQL Performance Benchmark&lt;/a&gt; on the current &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1dcaaa48&quot;&gt;Virtuoso&lt;/a&gt; Open Source Edition, v5.0.8.&lt;/p&gt; &lt;p&gt;I ran it with the 5M triples scale, which is the highest scale for which the authors give numbers.&lt;/p&gt; &lt;p&gt;I got a run time of 25 minutes for the 12 queries, giving an arithmetic mean of the query time of 125 seconds. This is better than the 800 or so seconds that the authors had measured. Also, Q6 of the set had failed for the authors, but we have since fixed this; the fix is in the v5.0.8 cut.&lt;/p&gt; &lt;p&gt;I also tried it with a scale of 25M, but this became I/O bound and took a bit longer. I will try this with v6 and v7 cluster later, which are vastly better at anything I/O bound.&lt;/p&gt; &lt;p&gt;The machine was a 2GHz Xeon with 8G RAM. The query text was the one from the authors, with an explicit &lt;code&gt;FROM&lt;/code&gt; clause added; the client was the command line Interactive &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1be2c808&quot;&gt;SQL&lt;/a&gt; (iSQL).&lt;/p&gt; &lt;p&gt;If one does the test with the default index layout without specifying a graph, things will not work very well. Also, returning the million-row results of these queries over the &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x1d7ac018&quot;&gt;SPARQL protocol&lt;/a&gt; is not practical.&lt;/p&gt; &lt;p&gt;I will say something more about SP&lt;sup&gt;2&lt;/sup&gt;B when I get to have a closer look.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-09-02T09:49:57.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Configuring Virtuoso for Benchmarking</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1419</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1419" type="text/html" rel="alternate" />
  <atom:published>2008-08-25T14:06:11Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Configuring Virtuoso for Benchmarking&lt;/div&gt; &lt;p&gt;I will here summarize what should be known about running benchmarks with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xc152cf0&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt; &lt;h2&gt;Physical Memory&lt;/h2&gt; &lt;p&gt;For 8G RAM, in the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; NumberOfBuffers = 550000 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;For 16G RAM, double this—&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; NumberOfBuffers = 1100000 &lt;/code&gt; &lt;/blockquote&gt; &lt;h2&gt;Transaction Isolation&lt;/h2&gt; &lt;p&gt;For most cases, certainly all &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xb7ba270&quot;&gt;RDF&lt;/a&gt; cases, &lt;i&gt;Read Committed&lt;/i&gt; should be the default transaction isolation. In the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; DefaultIsolation = 2 &lt;/code&gt; &lt;/blockquote&gt; &lt;h2&gt;Multiuser Workload&lt;/h2&gt; &lt;p&gt;If &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x1a40f308&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1e003cf8&quot;&gt;JDBC&lt;/a&gt;, or similarly connected client applications are used, there must be more &lt;code&gt;ServerThreads&lt;/code&gt; available than there will be client connections. In the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; ServerThreads = 100 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer &lt;code&gt;ServerThreads&lt;/code&gt; than there are concurrent clients. The &lt;code&gt;MaxKeepAlives&lt;/code&gt; should be the maximum number of expected web clients. This can be more than the &lt;code&gt;ServerThreads&lt;/code&gt; count. In the &lt;code&gt;[HTTPServer]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [HTTPServer]&lt;br /&gt; ...&lt;br /&gt; ServerThreads = 100 &lt;br /&gt; MaxKeepAlives = 1000 &lt;br /&gt; KeepAliveTimeout = 10 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt; &lt;i&gt;&lt;b&gt;Note&lt;/b&gt; — The &lt;code&gt;[HTTPServer] ServerThreads&lt;/code&gt; are taken from the total pool made available by the &lt;code&gt;[Parameters] ServerThreads&lt;/code&gt;. Thus, the &lt;code&gt;[Parameters] ServerThreads&lt;/code&gt; should always be at least as large as (and is best set greater than) the &lt;code&gt;[HTTPServer] ServerThreads&lt;/code&gt;, and if using the closed-source Commercial Version, should not exceed the licensed thread count.&lt;/i&gt; &lt;/p&gt; &lt;h2&gt;Disk Use&lt;/h2&gt; &lt;p&gt;The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID. For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed). &lt;/p&gt; &lt;p&gt;For the above described example, in the &lt;code&gt;[Database]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Database]&lt;br /&gt; ...&lt;br /&gt; Striping = 1&lt;br /&gt; MaxCheckpointRemap = 2000000 &lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;— and in the &lt;code&gt;[Striping]&lt;/code&gt; stanza, on one line per &lt;code&gt;SegmentName&lt;/code&gt;, set —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Striping]&lt;br /&gt; ...&lt;br /&gt; Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6&lt;/code&gt; &lt;/blockquote&gt; &lt;p&gt;As can be seen here, each file gets a background IO thread (the &lt;code&gt;= q&lt;i&gt;xxx&lt;/i&gt;&lt;/code&gt; clause). It should be noted that all files on the same physical device should have the same &lt;code&gt;q&lt;i&gt;xxx&lt;/i&gt;&lt;/code&gt; value. This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.&lt;/p&gt; &lt;h2&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xc8b97c0&quot;&gt;SQL&lt;/a&gt; Optimization&lt;/h2&gt; &lt;p&gt;If queries have lots of joins but access little &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x193b2fa8&quot;&gt;data&lt;/a&gt;, as with the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1b283ca0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt;, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far. Thus, in the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set —&lt;/p&gt; &lt;blockquote&gt; &lt;code&gt; [Parameters]&lt;br /&gt; ...&lt;br /&gt; StopCompilerWhenXOverRunTime = 1 &lt;/code&gt; &lt;/blockquote&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="rdf" />
  <atom:category term="jdbc" />
  <atom:category term="sql" />
  <atom:category term="odbc" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-08-25T15:29:06.36000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>BSBM With Triples and Mapped Relational Data</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1410</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1410" type="text/html" rel="alternate" />
  <atom:published>2008-08-06T19:41:50Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;BSBM With Triples and Mapped Relational Data&lt;/div&gt; &lt;p&gt;The special contribution of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id10039db0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id106b2538&quot;&gt;BSBM&lt;/a&gt;) to the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id101a75f8&quot;&gt;RDF&lt;/a&gt; world is to raise the question of doing OLTP with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xae54170&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;Of course, here we immediately hit the question of comparisons with relational databases. To this effect, &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1e847b08&quot;&gt;BSBM&lt;/a&gt; also specifies a relational schema and can generate the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id1206c378&quot;&gt;data&lt;/a&gt; as either triples or &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1667f040&quot;&gt;SQL&lt;/a&gt; inserts.&lt;/p&gt; &lt;p&gt;The benchmark effectively simulates the case of exposing an existing &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id10a93518&quot;&gt;RDBMS&lt;/a&gt; as RDF. &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id13e46d80&quot;&gt;OpenLink Software&lt;/a&gt; calls this &lt;i&gt;RDF Views&lt;/i&gt;. &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id12027578&quot;&gt;Oracle&lt;/a&gt; is beginning to call this &lt;i&gt;semantic covers&lt;/i&gt;. The &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id161dc678&quot;&gt;RDB2RDF XG&lt;/a&gt;, a W3C incubator group, has been active in this area since Spring, 2008.&lt;/p&gt; &lt;h3&gt;But why an OLTP workload with RDF to begin with?&lt;/h3&gt; &lt;p&gt;We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS. If &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1e7119d8&quot;&gt;data&lt;/a&gt; is online for human consumption, it may be online via a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id106a8908&quot;&gt;SPARQL&lt;/a&gt; end-point as well. The economic justification will come from discoverability and from applications integrating multi-source structured data. Online shopping is a fine use case.&lt;/p&gt; &lt;p&gt;Warehousing all the world&amp;#39;s publishable data as RDF is not our first preference, nor would it be the publisher&amp;#39;s. Considerations of duplicate infrastructure and maintenance are reason enough. Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&amp;#39;ll do here.&lt;/p&gt; &lt;h3&gt;What We Got &lt;/h3&gt; &lt;p&gt;First, we found that &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400&quot; id=&quot;link-id150ea748&quot;&gt;making the query plan took much too long&lt;/a&gt; in proportion to the run time. With BSBM this is an issue because the queries have lots of joins but access relatively little data. So we made a faster compiler and along the way retouched the cost model a bit.&lt;/p&gt; &lt;p&gt;But the really interesting part with BSBM is mapping relational data to RDF. For us, BSBM is a great way of showing that mapping can outperform even the best triple store. A relational row store is as good as unbeatable with the query mix. And when there is a clear mapping, there is no reason the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xae5aff0&quot;&gt;SPARQL&lt;/a&gt; could not be directly translated.&lt;/p&gt; &lt;p&gt;If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!&lt;/p&gt; &lt;p&gt;We filled two &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id12dbdc70&quot;&gt;Virtuoso&lt;/a&gt; instances with a BSBM200000 data set, for 100M triples. One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples. Performance figures are given in &amp;quot;query mixes per hour&amp;quot;. (An update or follow-on to this post will provide elapsed times for each test run.)&lt;/p&gt; &lt;p&gt;With the unmodified benchmark we got:&lt;/p&gt; &lt;blockquote&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td&gt;1297 qmph&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td&gt;&lt;b&gt;3144 qmph&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label. We altered Q6 to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)&lt;/p&gt; &lt;p&gt;The following were measured on the second run of a 100 query mix series, single test driver, warm cache.&lt;/p&gt; &lt;blockquote&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td&gt; 5746 qmph&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td&gt; &lt;b&gt;7525 qmph&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.&lt;/p&gt; &lt;blockquote&gt; &lt;table&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td&gt; 19459 qmph&lt;/td&gt; &lt;/tr&gt; &lt;tr&gt; &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt; &lt;/td&gt; &lt;td&gt;   &lt;/td&gt; &lt;td&gt; &lt;b&gt;24531 qmph&lt;/b&gt; &lt;/td&gt; &lt;/tr&gt; &lt;/table&gt; &lt;/blockquote&gt; &lt;p&gt;The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention. The numbers do not evidence significant overhead from thread synchronization.&lt;/p&gt; &lt;p&gt;The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher. We used the &lt;code&gt;StopCompilerWhenXOverRunTime = 1&lt;/code&gt; option here to cut needless compiler overhead, the queries being straightforward enough.&lt;/p&gt; &lt;p&gt;We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.&lt;/p&gt; &lt;h3&gt;Suggestions for BSBM&lt;/h3&gt; &lt;ul&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Reporting Rules.&lt;/b&gt; The benchmark spec should specify a form for disclosure of test run data, TPC style. This includes things like configuration parameters and exact text of queries. There should be accepted variants of query text, as with the TPC.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Multiuser operation.&lt;/b&gt; The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Add business intelligence.&lt;/b&gt; SPARQL has aggregates now, at least with &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id11a25ac0&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xb003180&quot;&gt;Virtuoso&lt;/a&gt;, so let&amp;#39;s use these. The BSBM business intelligence metric should be a separate metric off the same data. Adding synthetic sales figures would make more interesting queries possible. For example, producing recommendations like &amp;quot;customers who bought this also bought xxx.&amp;quot;&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;For the SPARQL community&lt;/b&gt;, BSBM sends the message that one ought to support parameterized queries and stored procedures. This would be a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id109e2448&quot;&gt;SPARQL protocol&lt;/a&gt; extension; the SPARUL syntax should also have a way of calling a procedure. Something like &lt;code&gt;select proc (??, ??)&lt;/code&gt; would be enough, where &lt;code&gt;??&lt;/code&gt; is a parameter marker, like &lt;code&gt;?&lt;/code&gt; in &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id13febf48&quot;&gt;ODBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id120416a8&quot;&gt;JDBC&lt;/a&gt;.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;b&gt;Add transactions.&lt;/b&gt;Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant. In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store. This could use stored procedures or logic in an app server.&lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;h3&gt;Comments on Query Mix&lt;/h3&gt; &lt;p&gt;The time of most queries is less than linear to the scale factor. Q6 is an exception if it is not implemented using a text index. Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.&lt;/p&gt; &lt;h2&gt;Next&lt;/h2&gt; &lt;p&gt;We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release. This also includes all the query optimization work done for BSBM. This will be available in the coming days.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="rdf" />
  <atom:category term="jdbc" />
  <atom:category term="sql" />
  <atom:category term="odbc" />
  <atom:category term="oracle" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="linux" />
  <atom:category term="openlink" />
  <atom:category term="virtuoso" />
  <atom:category term="dataspace" />
  <atom:category term=".net" />
  <atom:updated>2008-08-06T16:29:44.3000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Exploiting the RDF-based Linked Data Web using .NET via LINQ</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1403</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1403" type="text/html" rel="alternate" />
  <atom:published>2008-08-01T17:58:19Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Exploiting the RDF-based Linked Data Web using .NET via LINQ&lt;/div&gt; Recently OpenLink has been investigating &lt;a href=&quot;http://code.google.com/p/linqtordf/&quot; id=&quot;link-id0x20d8a248&quot;&gt;LinqToRdf&lt;/a&gt;, an exciting project from &lt;a href=&quot;http://aabs.wordpress.com&quot; id=&quot;link-id0x21f48218&quot;&gt;Andrew Matthews&lt;/a&gt; which aims to bring the Semantic Web to .NET. Because of their language bindings and heritage, existing RDF APIs such as Sesame, Jena and Redland predominantly favour non-Windows clients. Conversely Microsoft&amp;#39;s ADO.NET Data Services provides a Redmond vision of exposing data on the Web but has no support for RDF. LinqToRdf is, as far as we&amp;#39;re aware, the first serious effort to fill this gap and provide a bridge between Windows applications and the Semantic Web.&lt;br /&gt; &lt;br /&gt;OpenLink has produced a whitepaper &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/linqtordf/linqtordf1.htm&quot; id=&quot;link-id0x21f47348&quot;&gt;Exploiting the RDF-based Linked Data Web using .NET via LINQ&lt;/a&gt; which provides a brief overview of LinqToRdf and an example of its use to retrieve data from the &lt;a href=&quot;http://musicbrainz.org&quot; id=&quot;link-id0x21f49a88&quot;&gt;MusicBrainz&lt;/a&gt; music metadatabase via an OpenLink Virtuoso Quad Store. The document also illustrates the use of the &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/pdf/sponger_whitepaper_10102007.pdf&quot; id=&quot;link-id0x21f92758&quot;&gt;Virtuoso Sponger&lt;/a&gt;, an &amp;quot;RDFizer&amp;quot; forming part of the RDF toolset provided with OpenLink Virtuoso Universal Server, to convert the raw MusicBrainz data to RDF on-the-fly. A further aim of the whitepaper is to draw attention to Andrew&amp;#39;s excellent effort and hopefully tempt members of the Semantic Web community to contribute.&lt;br /&gt; &lt;br /&gt;Andrew was kind enough to incorporate some changes into LinqToRdf in response to OpenLink&amp;#39;s testing. These have been included with major improvements of his own in a new release - &lt;a href=&quot;http://aabs.wordpress.com/2008/08/01/announcing-linqtordf-v08/&quot; id=&quot;link-id0x21f465b8&quot;&gt;LinqToRdf v0.8&lt;/a&gt;.&lt;br /&gt; &lt;br /&gt;Carl Blakeley&lt;br /&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="linqtordf linq semantic web .net" />
  <atom:updated>2008-08-01T13:58:19.4000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso Optimizations for the Berlin SPARQL Benchmark</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1401</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1401" type="text/html" rel="alternate" />
  <atom:published>2008-07-30T18:52:11Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Virtuoso Optimizations for the Berlin SPARQL Benchmark &lt;/div&gt; &lt;p&gt;We had a look at Chris Bizer&amp;#39;s initial results with the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id105c9f78&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id102d62b0&quot;&gt;BSBM&lt;/a&gt;) on &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id13eb9780&quot;&gt;Virtuoso&lt;/a&gt;. The first results were rather bad, as nearly all of the run time was spent optimizing the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id14a51258&quot;&gt;SPARQL&lt;/a&gt; statements and under 10% actually running them.&lt;/p&gt; &lt;p&gt;So I spent a couple of days on the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xaad28d0&quot;&gt;SPARQL&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id108745b0&quot;&gt;SQL&lt;/a&gt; compiler, to the effect of making it do a better guess of initial execution plan and streamlining some operations. In fact, many of the queries in &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0xaa230b8&quot;&gt;BSBM&lt;/a&gt; are not particularly sensitive to execution plan, as they access a very small portion of the database. So to close the matter, I put in a flag that makes the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1e9e8e28&quot;&gt;SQL&lt;/a&gt; compiler give up on devising new plans if the time of the best plan so far is less than the time spent compiling so far.&lt;/p&gt; &lt;p&gt;With these changes, available now as a diff on top of 5.0.7, we run quite well, several times better than initially. With the compiler time cut-off in place (ini parameter &lt;code&gt;StopCompilerWhenXOverRunTime = 1&lt;/code&gt;), we get the following times, output from the BSBM test driver:&lt;/p&gt; &lt;blockquote&gt; &lt;pre&gt; Starting test... 0: 1031.22 ms, total: 1151 ms 1: 982.89 ms, total: 1040 ms 2: 923.27 ms, total: 968 ms 3: 898.37 ms, total: 932 ms 4: 855.70 ms, total: 865 ms Scale factor: 10000 Number of query mix runs: 5 times min/max Query mix runtime: 0.8557 s / 1.0312 s Total runtime: 4.691 seconds QMpH: 3836.77 query mixes per hour CQET: 0.93829 seconds average runtime of query mix CQET (geom.): 0.93625 seconds geometric mean runtime of query mix Metrics for Query 1: Count: 5 times executed in whole run AQET: 0.012212 seconds (arithmetic mean) AQET(geom.): 0.009934 seconds (geometric mean) QPS: 81.89 Queries per second minQET/maxQET: 0.00684000s / 0.03115700s Average result count: 7.0 min/max result count: 3 / 10 Metrics for Query 2: Count: 35 times executed in whole run AQET: 0.030490 seconds (arithmetic mean) AQET(geom.): 0.029776 seconds (geometric mean) QPS: 32.80 Queries per second minQET/maxQET: 0.02467300s / 0.06753000s Average result count: 22.5 min/max result count: 15 / 30 Metrics for Query 3: Count: 5 times executed in whole run AQET: 0.006947 seconds (arithmetic mean) AQET(geom.): 0.006905 seconds (geometric mean) QPS: 143.95 Queries per second minQET/maxQET: 0.00580000s / 0.00795100s Average result count: 4.0 min/max result count: 0 / 10 Metrics for Query 4: Count: 5 times executed in whole run AQET: 0.008858 seconds (arithmetic mean) AQET(geom.): 0.008829 seconds (geometric mean) QPS: 112.89 Queries per second minQET/maxQET: 0.00804400s / 0.01019500s Average result count: 3.4 min/max result count: 0 / 10 Metrics for Query 5: Count: 5 times executed in whole run AQET: 0.087542 seconds (arithmetic mean) AQET(geom.): 0.087327 seconds (geometric mean) QPS: 11.42 Queries per second minQET/maxQET: 0.08165600s / 0.09889200s Average result count: 5.0 min/max result count: 5 / 5 Metrics for Query 6: Count: 5 times executed in whole run AQET: 0.131222 seconds (arithmetic mean) AQET(geom.): 0.131216 seconds (geometric mean) QPS: 7.62 Queries per second minQET/maxQET: 0.12924200s / 0.13298200s Average result count: 3.6 min/max result count: 3 / 5 Metrics for Query 7: Count: 20 times executed in whole run AQET: 0.043601 seconds (arithmetic mean) AQET(geom.): 0.040890 seconds (geometric mean) QPS: 22.94 Queries per second minQET/maxQET: 0.01984400s / 0.06012600s Average result count: 26.4 min/max result count: 5 / 96 Metrics for Query 8: Count: 10 times executed in whole run AQET: 0.018168 seconds (arithmetic mean) AQET(geom.): 0.016205 seconds (geometric mean) QPS: 55.04 Queries per second minQET/maxQET: 0.01097600s / 0.05066900s Average result count: 12.8 min/max result count: 6 / 20 Metrics for Query 9: Count: 20 times executed in whole run AQET: 0.043813 seconds (arithmetic mean) AQET(geom.): 0.043807 seconds (geometric mean) QPS: 22.82 Queries per second minQET/maxQET: 0.04274900s / 0.04504100s Average result count: 0.0 min/max result count: 0 / 0 Metrics for Query 10: Count: 15 times executed in whole run AQET: 0.030697 seconds (arithmetic mean) AQET(geom.): 0.029651 seconds (geometric mean) QPS: 32.58 Queries per second minQET/maxQET: 0.02072000s / 0.03975700s Average result count: 1.1 min/max result count: 0 / 4 real 0 m 5.485 s user 0 m 2.233 s sys 0 m 0.170 s &lt;/pre&gt;&lt;/blockquote&gt; &lt;p&gt;Of the approximately 5.5 seconds of running five query mixes, the test driver spends 2.2 s. The server side processing time is 3.1 s, of which SQL compilation is 1.35 s. The rest is miscellaneous system time. The measurement is on 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM. &lt;/p&gt; &lt;p&gt;We note that this type of workload would be done with stored procedures or prepared, parameterized queries in the SQL world.&lt;/p&gt; &lt;p&gt;There will be some further tuning still but this addresses the bulk of the matter. There will be a separate message about the patch containing these improvements.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="benchmarking" />
  <atom:category term="scalability" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="linux" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-08-06T16:29:42.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1393</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1393" type="text/html" rel="alternate" />
  <atom:published>2008-07-17T17:18:09Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;Virtuoso 5.0.7 Release, Now With Jena and Sesame APIs&lt;/div&gt; &lt;h2&gt;Improvements&lt;/h2&gt; &lt;ul&gt; &lt;li&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfnativestorageproviders.html&quot; id=&quot;link-id13e54d98&quot;&gt;Full operation&lt;/a&gt; with &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x11a3d360&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x1108d428&quot;&gt;Sesame&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1288aa00&quot;&gt;RDF&lt;/a&gt; Frameworks. This fully replaces any previous attempts at interop, and introduces samples and test suites.&lt;/li&gt; &lt;li&gt;Better support for alternate RDF indexing schemes&lt;/li&gt; &lt;li&gt;Parallel operation of the RDF Sponger, importing multiple sources concurrently.&lt;/li&gt; &lt;li&gt;New &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x128a9810&quot;&gt;data&lt;/a&gt; formats supported for on-demand RDF-ization in the Sponger&lt;/li&gt; &lt;li&gt;More efficient support for inference of subclass and sub-property; now capable of efficiently handling taxonomies of tens of thousands of classes&lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x6af0678&quot;&gt;OWL&lt;/a&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfsparqlrule.html#rdfsparqlruleintro&quot; id=&quot;link-id104d58d8&quot;&gt;equivalentClass and equivalentProperty&lt;/a&gt; support.&lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfdatarepresentation.html#rdfdynamiclocal&quot; id=&quot;link-id109606a8&quot;&gt;Dynamic IRI host part&lt;/a&gt; support for mapped data and for metadata of local resources. Renaming the host or using multiple virtual hosts will accept URIs with the right host part and refer to the same thing, no duplicate storage required.&lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x12e0cc38&quot;&gt;SPARQL&lt;/a&gt; optimizations for &lt;code&gt;LIMIT&lt;/code&gt; and &lt;code&gt;OFFSET&lt;/code&gt; &lt;/li&gt; &lt;/ul&gt; &lt;h2&gt;Documentation&lt;/h2&gt; &lt;ul&gt; &lt;li&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/perfdiag.html#perfdiagqueryplans&quot; id=&quot;link-id10a56dd0&quot;&gt;How to read query plans and how to use the key performance meters&lt;/a&gt; &lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfperformancetuning.html#rdfperfcost&quot; id=&quot;link-id106cb5c0&quot;&gt;How to diagnose SPARQL queries and how to decide what indexing scheme is right for each RDF use case&lt;/a&gt; &lt;/li&gt; &lt;li&gt;How to debug RDF views&lt;/li&gt; &lt;ul&gt; &lt;li&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/sparqldebug.html&quot; id=&quot;link-id133b4420&quot;&gt;Better documentation of SPARQL extensions and options&lt;/a&gt; &lt;/li&gt; &lt;li&gt; &lt;a href=&quot;http://docs.openlinksw.com:80/virtuoso/rdfviews.html#rdfviewnorthwindexample1&quot; id=&quot;link-id1060fdd8&quot;&gt;A sample of correct RDF view usage with the Northwind demo data&lt;/a&gt; &lt;/li&gt; &lt;/ul&gt; &lt;/ul&gt; &lt;h2&gt;Bug Fixes&lt;/h2&gt; &lt;ul&gt; &lt;li&gt;Generally improved safety of built-in functions, better argument checking.&lt;/li&gt; &lt;li&gt;Verified UTF8 international character support in all RDF use cases, &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x12839fd0&quot;&gt;SQL&lt;/a&gt; client/&lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x1288f350&quot;&gt;SPARQL protocol&lt;/a&gt;/all data formats.&lt;/li&gt; &lt;/ul&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="rdf" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="howto" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-07-17T15:28:22.2000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>De Paradigmata and The Foundational Issues</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1383</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1383" type="text/html" rel="alternate" />
  <atom:published>2008-06-09T14:02:21Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;De Paradigmata and The Foundational Issues&lt;/div&gt; &lt;p&gt;I thought that we had talked ourselves to exhaustion and beyond over the issue of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1dd07c68&quot;&gt;semantic web&lt;/a&gt; layer cake. Apparently not. There was a paper called &lt;i&gt;Functional Architecture for the Semantic Web&lt;/i&gt; by &lt;a href=&quot;http://gerberaj.googlepages.com/&quot; id=&quot;link-id106b8130&quot;&gt;Aurona Gerber&lt;/a&gt; et al at &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x17137300&quot;&gt;ESWC2008&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The thrust of the matter was that for newcomers the layer cake was confusing and did not clearly indicate the architecture. Why, sure. My point is that no rearranging of the boxes will cut it for the general case.&lt;/p&gt; &lt;p&gt;Any diagram containing the boxes of the layer cake (i.e., &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1a9138c0&quot;&gt;URI&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/XML&quot; id=&quot;link-id0x1cc4a8d8&quot;&gt;XML&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xa21c1308&quot;&gt;SPARQL&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x1aa28050&quot;&gt;OWL&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Rule_Interchange_Format&quot; id=&quot;link-id0x137268d0&quot;&gt;RIF&lt;/a&gt;, Crypto, etc., etc.) in whatever order or arrangement can at best be a sort of overview of how these standards reference each other.&lt;/p&gt; &lt;p&gt;Such diagrams are a little like saying that a car combines the combustion properties of fuel/air mixes with the tension and compression resistance properties of metals and composites for producing motion and secondly links to Newton&amp;#39;s laws of motion and to aerodynamics.&lt;/p&gt; &lt;p&gt;Not false. But it does not say that a car is good for economical commute or showing off at the strip or any number of niches that a mature industry has grown to serve.&lt;/p&gt; &lt;p&gt;Now, talking of software engineering, modules and interfaces are good and even necessary. The trick is to know where to put the interface.&lt;/p&gt; &lt;p&gt;Such a thing cannot possibly be inferred from the standards&amp;#39; inter-reference picture. APIs, especially if these are Web service APIs, should go where there is low &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x196fcba0&quot;&gt;data&lt;/a&gt; volume and tolerance for latency. For example, either inference is a preprocessing step or it is embedded right inside a SPARQL engine. Such a thing cannot be seen from the picture. Same for trust. Trust is not an after-thought at the top of the picture, except maybe in the sense of referring to the other parts.&lt;/p&gt; &lt;p&gt;We hear it over and over. Scale and speed are critical. Arrange the blocks of any real system as makes sense for data flow; do not confuse literature references with control or data structure.&lt;/p&gt; &lt;p&gt;The even-more foundational issue is the promotion of the general concept of a Web of Data.&lt;/p&gt; &lt;p&gt;The core idea that the Web would be a query-able collection of data with meaningful reference between data of different provenance cannot be inferred from the picture, even though this should be its primary message. Or it is better to say that the first picture shown should stress this idea and then one could leave the layer cake, in whatever version, for explaining the standards&amp;#39; order of evolution or inter-reference.&lt;/p&gt; &lt;p&gt;So, the value proposition:&lt;/p&gt; &lt;p&gt;Why? Explosion of data volume, increased need of keeping up-to-date, increasing opportunity cost of not keeping in real time.&lt;/p&gt; &lt;p&gt;What? An architecture that is designed for unanticipated joining and evolution of data across heterogeneous sources, either at Web or enterprise scale.&lt;/p&gt; &lt;p&gt;How? URI everything and everything is cool, or, give things global names. Use &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x13700d00&quot;&gt;RDF&lt;/a&gt;. Reuse names or ontologies where can. (An ontology is a set of classes and property names plus some more.) Map relational data on the fly or store as RDF, whichever works. Query with SPARQL, easier than &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x17865208&quot;&gt;SQL&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;So, my challenge for the graphics people would be to make an illustration of the above. Forget the alphabet soup. Show the layer cake as a historical reference or literature guide. Do not imply that this proliferation of boxes equates to an equal proliferation of Web services, for example.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="architecture" />
  <atom:category term="hpc" />
  <atom:category term="webservices" />
  <atom:category term="rdf" />
  <atom:category term="xml" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:updated>2008-06-11T15:54:49.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>voiD, or Will the LOD Cloud Bring Rain?</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1382</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1382" type="text/html" rel="alternate" />
  <atom:published>2008-06-09T14:02:20Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;voiD, or Will the LOD Cloud Bring Rain?&lt;/div&gt; &lt;p&gt;At &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x1c3bec48&quot;&gt;ESWC2008&lt;/a&gt;, we saw the &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x1f0db270&quot;&gt;Linked Open Data&lt;/a&gt; Cloud condense its first drops of precipitation.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://community.linkeddata.org/MediaWiki/index.php?MetaLOD#Kick-off_meeting_at_ESWC08&quot; id=&quot;link-id106ee858&quot;&gt;voiD, Vocabulary of Interlinked Datasets&lt;/a&gt;, is an idea whose time has clearly come. By the end of the conference, many speakers had already adopted the &lt;a href=&quot;http://dbpedia.org/resource/Meme&quot; id=&quot;link-id0x16c99ad0&quot;&gt;meme&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;The point is to describe what is inside the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1c540958&quot;&gt;data&lt;/a&gt; sets. People may know this from having worked with the sets or from putting them together but to an outsider this is not evident.&lt;/p&gt; &lt;p&gt;The Semantic Sitemap says where there are files or end points for access. But it does not say what is inside these. Also for federation, it is important to be able to determine whether it makes sense to send a particular query to a particular end point.&lt;/p&gt; &lt;p&gt;If we play this right, this is what voiD will provide. I have to think of Dan Simmons&amp;#39; flamboyant Hyperion sci-fi series where the &amp;quot;void which binds&amp;quot; was a sort of hyperspace containing the thoughts of entities, past and present and even provided teleportation.&lt;/p&gt; &lt;p&gt;So what does the voiD hold, aside infinite potentialities?&lt;/p&gt; &lt;p&gt;The obvious part is DC-like provenance, version, authorship, license and such data set wide &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x16c05280&quot;&gt;information&lt;/a&gt;. Also the subject matter could be classified by reference to &lt;a href=&quot;http://umbel.org/about/&quot; id=&quot;link-id0x1abf1558&quot;&gt;UMBEL&lt;/a&gt; or the &lt;a href=&quot;http://www.mpi-inf.mpg.de/~suchanek/downloads/yago/&quot; id=&quot;link-id0x1b49ee78&quot;&gt;Yago&lt;/a&gt; classification of &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x184dea28&quot;&gt;DBpedia&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;More is needed, though. The simple part is listing the ontologies, if any. Also a set of namespaces would be an idea but this could be very large.&lt;/p&gt; &lt;p&gt;So let us look at what we&amp;#39;d like to be able to answer with the voiD set.&lt;/p&gt; &lt;p&gt;The below could be a sample of voiD questions?&lt;/p&gt; &lt;ul&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;What subjects are in the &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x1bbac318&quot;&gt;LOD&lt;/a&gt; cloud?&lt;/i&gt; &lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;Given this &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1f74c7e8&quot;&gt;URI&lt;/a&gt;, what set in the LOD cloud can tell me more?&lt;/i&gt; This is divided into asking a text index like &lt;a href=&quot;http://sindice.org/&quot; id=&quot;link-id0x1d57a8f8&quot;&gt;Sindice&lt;/a&gt; for the location, getting the namespace or data set and then querying voiD.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;What need I federate/load in order to combine all that is reachable from a given vocabulary?&lt;/i&gt; There could be for example a graph showing the data sets and edges between them, edges being qualified by a set of same as assertions, itself a voiD described set, if translations were needed.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;What sets are from the same or equally trusted publisher as this one?&lt;/i&gt; &lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;p&gt;These things are roughly divided into description of the set and then some details on how it is stored on a given end point.&lt;/p&gt; &lt;ul&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;Given this set, in which other sets will I find use of the same URIs?&lt;/i&gt; For example, if I have language version x, I wish to know that language version y will have the same URIs insofar the things meant are the same.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;Given this set, which sets of same as assertions will I have for mapping to which other sets?&lt;/i&gt; For example, if I have &lt;a href=&quot;http://www.geonames.org/&quot; id=&quot;link-id0x1b372140&quot;&gt;Geonames&lt;/a&gt;, I wish to know that set x will map at least some of the URIs in Geonames to DBpedia URIs.&lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Let me further point out that it is increasingly clear to the community that universal sameAs is dubious, hence sameAs assertions ought to be kept separate and included or excluded depending on the usage context.&lt;/p&gt; &lt;ul&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;Given this set, what are the interesting queries I can do?&lt;/i&gt; This is a sort of advertisement for human consumption. This is not a list of queries for crashing the end point. Denial of service can be done in &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1b25dea8&quot;&gt;SPARQL&lt;/a&gt; without knowing the end point content anyhow, so this is not an added risk exposer.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;Vocabularies used.&lt;/i&gt; This is a reference to the OWL or RDFS resources giving the applicable ontologies, if present. Also, a complete list of classes whose direct instances actually occur in the set is useful.&lt;/p&gt; &lt;/li&gt; &lt;li&gt; &lt;p&gt; &lt;i&gt;Ballpark cardinality.&lt;/i&gt; Something like a &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id0x1ed8f580&quot;&gt;DARQ&lt;/a&gt; optimization profile would be a good idea. I would say that there should be a possibility of just including a DARQ description file as is. This is a sort of baseline and since it already exist, we are spared the committee trouble of figuring out what it ought to contain and what not. If we start defining this from scratch, it will take long. Further, let this be optional. Quite Independently of this, query processors may make optimization related queries to remote end points insofar the specific end point supports these. This will come in time. For now, just the basics.&lt;/p&gt; &lt;/li&gt; &lt;/ul&gt; &lt;p&gt;Along with this, LOD SPARQL end points could adopt a couple of basic conventions. The simplest would be to agree that each would host a graph with a given URI that would contain the voiD descriptions of the data sets contained, along with the graph URI used for each set, if different from the publisher&amp;#39;s URI for the graph. There is a point to this since an end point may load multiple data sets into one graph.&lt;/p&gt; &lt;p&gt;We hope to have a good idea of the matter in a couple of weeks, certainly a general statement of direction to be published at &lt;a href=&quot;http://www.linkeddataplanet.com/&quot; id=&quot;link-id0x1b049830&quot;&gt;Linked Data Planet&lt;/a&gt; in a couple of weeks.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:updated>2008-06-11T15:15:21.000-04:00</atom:updated>
 </atom:entry>
 <atom:entry>
  <atom:title>The DARQ Matter of Federation</atom:title>
  <atom:id>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1381</atom:id>
  <atom:link href="http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1381" type="text/html" rel="alternate" />
  <atom:published>2008-06-09T14:02:19Z</atom:published>
  <atom:content type="html">&lt;div&gt; &lt;div style=&quot;display:none;&quot;&gt;The DARQ Matter of Federation&lt;/div&gt; &lt;p&gt;Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &amp;quot;dark matter&amp;quot; spread in interstellar and intergalactic space.&lt;/p&gt; &lt;p&gt;For the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x19dbf410&quot;&gt;data&lt;/a&gt; web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in &lt;a href=&quot;http://dbpedia.org/resource/Cyberspace&quot; id=&quot;link-id0x9fc13ff8&quot;&gt;cyberspace&lt;/a&gt;.&lt;/p&gt; &lt;p&gt;To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.&lt;/p&gt; &lt;p&gt; &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id0x1d06bd88&quot;&gt;DARQ&lt;/a&gt; is Bastian Quilitz&amp;#39;s federated extension of the &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x1cf28f70&quot;&gt;Jena&lt;/a&gt; &lt;a href=&quot;http://jena.sourceforge.net/ARQ/&quot; id=&quot;link-id0x1cba22c8&quot;&gt;ARQ&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x171c7dc8&quot;&gt;SPARQL&lt;/a&gt; processor. It has existed for a while and was also presented at &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x1ed53cd0&quot;&gt;ESWC2008&lt;/a&gt;. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.&lt;/p&gt; &lt;p&gt;Bastian had split &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x1ce846c0&quot;&gt;DBpedia&lt;/a&gt; among five &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1cad0640&quot;&gt;Virtuoso&lt;/a&gt; servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.&lt;/p&gt; &lt;p&gt;Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.&lt;/p&gt; &lt;p&gt;Since we are convinced of the cause, let&amp;#39;s talk about the means.&lt;/p&gt; &lt;p&gt;For DARQ as it now stands, there&amp;#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x19a48280&quot;&gt;HTTP&lt;/a&gt;/1.1 message. So, if the query is &amp;quot;get me my friends living in cities of over a million people,&amp;quot; there will be the fragment &amp;quot;get city where x lives&amp;quot; and later &amp;quot;ask if population of x greater than 1000000&amp;quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server.&lt;/p&gt; &lt;p&gt;Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1cf18278&quot;&gt;JDBC&lt;/a&gt; so Bastian can try this if interested.&lt;/p&gt; &lt;p&gt;These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.&lt;/p&gt; &lt;p&gt;When federating &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1cf7d0e8&quot;&gt;SQL&lt;/a&gt;, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:&lt;/p&gt; &lt;p&gt;If a foaf:Person is found on a given server, this does not mean that the Person&amp;#39;s geek code or email hash will be on the same server. Thus &lt;code&gt;{?p name &amp;quot;Johnny&amp;quot; . ?p geekCode ?g . ?p emailHash ?h }&lt;/code&gt; does not necessarily denote a colocated join if many servers serve items of the vocabulary.&lt;/p&gt; &lt;p&gt;However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.&lt;/p&gt; &lt;p&gt;For Virtuoso, we find that a federated query&amp;#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.&lt;/p&gt; &lt;p&gt;For description, we would take DARQ&amp;#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.&lt;/p&gt; &lt;p&gt;We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&amp;#39;ll see.&lt;/p&gt; &lt;p&gt;Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x1e163140&quot;&gt;Sponger&lt;/a&gt;. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.&lt;/p&gt; &lt;p&gt;The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-id1109a4c8&quot;&gt;a separate post&lt;/a&gt; about this.&lt;/p&gt; &lt;/div&gt;</atom:content>
  <atom:author>
    <atom:name>Virtuso Data Space Bot</atom:name>
    <atom:email>kidehen@openlinksw.com</atom:email>
   </atom:author>
  <atom:category term="database" />
  <atom:category term="databases" />
  <atom:category term="jdbc" />
  <atom:category term="sql" />
  <atom:category term="web30" />
  <atom:category term="foaf" />
  <atom:category term="semanticweb" />
  <atom:category term="sparql" />
  <atom:category term="howto" />
  <atom:category term="socialnetworking" />
  <atom:category term="virtuoso" />
  <atom:updated>2008-06-11T15:15:14.000-04:00</atom:updated>
 </atom:entry>
</atom:feed>