<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>

<title>OpenLink Virtuoso (Product Blog)</title><link>http://virtuoso.openlinksw.com/blog/vdb/blog/</link><description>A great place to track Virtuoso&#39;s rapid evolution.</description><managingEditor>kidehen@openlinksw.com</managingEditor><pubDate>Wed, 22 May 2013 11:33:15 GMT</pubDate><generator>Virtuoso Universal Server 06.04.3135</generator><webMaster>kidehen@openlinksw.com</webMaster><image><title>OpenLink Virtuoso (Product Blog)</title><url>http://virtuoso.openlinksw.com/weblog/public/images/vbloglogo.gif</url><link>http://virtuoso.openlinksw.com/blog/vdb/blog/</link><description>A great place to track Virtuoso&#39;s rapid evolution.</description><width>88</width><height>31</height></image>
<item><title>RDF and Transactions</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1690</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1690#comments</comments><pubDate>Tue, 22 Mar 2011 22:52:56 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2011-03-22T17:44:21-04:00</n0:modified><description>&lt;p&gt;I will here talk about &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x249bc940&quot;&gt;RDF&lt;/a&gt; and transactions for developers in general. The next one talks about specifics and is for specialists.&lt;/p&gt;

&lt;p&gt;Transactions are certainly not the first thing that comes to mind when one hears &amp;quot;RDF&amp;quot;.  We have at times used a recruitment questionnaire where we ask applicants to define a transaction.  Many vaguely remember that it is a unit of work, but usually not more than that.  We sometimes get questions from users about why they get an error message that says &amp;quot;deadlock&amp;quot;.  &amp;quot;Deadlock&amp;quot; is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order.  What does this have to do with RDF?&lt;/p&gt;

&lt;p&gt;There are in fact users who even use XA with a &lt;a class=&quot;auto-href&quot; href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x22c8dbc8&quot;&gt;Virtuoso&lt;/a&gt;-based RDF application.  &lt;a class=&quot;auto-href&quot; href=&quot;http://semanticweb.org/id/Franz_Inc&quot; id=&quot;link-id0x27bd0c08&quot;&gt;Franz&lt;/a&gt; also has publicized their development of full &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/ACID&quot; id=&quot;link-id0x283985c8&quot;&gt;ACID&lt;/a&gt; capabilities for &lt;a class=&quot;auto-href&quot; href=&quot;http://semanticweb.org/id/AllegroGraph&quot; id=&quot;link-id0x238ba438&quot;&gt;AllegroGraph&lt;/a&gt;.  RDF is a database &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id0x2864fef8&quot;&gt;schema&lt;/a&gt; model, and transactions will inevitably become an issue in databases.&lt;/p&gt;

&lt;p&gt;At the same time, the developer population trained with &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id0x284d2d80&quot;&gt;MySQL&lt;/a&gt; and &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x237230e8&quot;&gt;PHP&lt;/a&gt; is not particularly transaction-aware.  Transactions have gone out of style, declares the No-&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x2920cc88&quot;&gt;SQL&lt;/a&gt; crowd.  Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post.  The &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x283f0588&quot;&gt;SPARQL&lt;/a&gt; language and protocol do not go into transactions, except for expressing the wish that an &lt;code&gt;UPDATE&lt;/code&gt; request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.  &lt;/p&gt;

&lt;p&gt;If one says that a thing will either happen &lt;i&gt;in its entirety&lt;/i&gt; or &lt;i&gt;not at all,&lt;/i&gt; which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x238280f8&quot;&gt;data&lt;/a&gt; at the same time?  Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0x276714b8&quot;&gt;C&lt;/a&gt;) consistency, which means that the transaction&amp;#39;s result must not contradict restrictions the database is supposed to enforce.  RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.&lt;/p&gt;

&lt;p&gt;There are, of course, database-like consistency criteria that one can express in RDF Schema and &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Web_Ontology_Language&quot; id=&quot;link-id0x28625a90&quot;&gt;OWL&lt;/a&gt;, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).  &lt;/p&gt;

&lt;p&gt;If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x249bf4f8&quot;&gt;RDBMS&lt;/a&gt; performance.  For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.&lt;/p&gt;

&lt;p&gt;There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS.  RDF could be seen to be split between the schema-last world and the &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x249504f8&quot;&gt;knowledge&lt;/a&gt; representation world.  I will here focus on the schema-last side.&lt;/p&gt;

&lt;p&gt;Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions.  The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.&lt;/p&gt;

&lt;p&gt;Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away.  This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.&lt;/p&gt;

&lt;p&gt;Analytics workloads are not primarily about transactions, but still need to specify what happens with updates.  Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.&lt;/p&gt;


&lt;p&gt;As mentioned before, the &lt;a class=&quot;auto-href&quot; href=&quot;http://lod2.eu/&quot; id=&quot;link-id0x27d952d0&quot;&gt;LOD2&lt;/a&gt; project is at the crossroads of RDF and database.  I construe its mission to be the making of RDF into a respectable database discipline.  Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.&lt;/p&gt;

&lt;p&gt;As previously argued, we need well-defined and auditable benchmarks.  This again brings up the topic of transactions.  Once we embark on the database benchmark route, there is no way around this. &lt;a class=&quot;auto-href&quot; href=&quot;http://www.tpc.org/&quot; id=&quot;link-id0x2359d2d0&quot;&gt;TPC&lt;/a&gt;-&lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x28edb770&quot;&gt;H&lt;/a&gt; mandates that the system under test support transactions, and the audit involves a test for this.  We can do no less.&lt;/p&gt;

&lt;p&gt;This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.  &lt;/p&gt;

&lt;p&gt;As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x23a55698&quot;&gt;ODBC&lt;/a&gt;, &lt;a class=&quot;auto-href&quot; href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x235cecf0&quot;&gt;JDBC&lt;/a&gt;, or the &lt;a class=&quot;auto-href&quot; href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x23213900&quot;&gt;Jena&lt;/a&gt; or &lt;a class=&quot;auto-href&quot; href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x277874d0&quot;&gt;Sesame&lt;/a&gt; frameworks), and setting the isolation options on the connection.  Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.&lt;/p&gt;

&lt;p&gt;Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.&lt;/p&gt;

&lt;p&gt;With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF.  It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.&lt;/p&gt;

&lt;p&gt;If all users lock resources they need in the same order, there will be no deadlocks.  This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent &lt;code&gt;INSERTs&lt;/code&gt; and &lt;code&gt;DELETEs&lt;/code&gt;, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking.  These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an &lt;code&gt;INSERT&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; fail because of deadlock with multiple users.   Vectored &lt;code&gt;INSERT&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt; are sufficient for    making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the &lt;code&gt;INSERT&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; of a small graph. &lt;/p&gt;

&lt;p&gt;Furthermore, since the &lt;a class=&quot;auto-href&quot; href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id0x23eadf50&quot;&gt;SPARQL protocol&lt;/a&gt; has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself.  If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking.  We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock.  Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.&lt;/p&gt;

&lt;p&gt;In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.&lt;/p&gt;</description></item><item><title>Fault Tolerance in Virtuoso Cluster Edition (Short Version)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-07#1621</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1621#comments</comments><pubDate>Wed, 07 Apr 2010 16:40:02 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2010-04-14T19:12:47.000003-04:00</n0:modified><description>&lt;p&gt;We have for some time had the option of storing &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x28eb2178&quot;&gt;data&lt;/a&gt; in a cluster in multiple copies, in the Commercial Edition of &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x25178ed0&quot;&gt;Virtuoso&lt;/a&gt;. (This feature is not in and is not planned to be added to the Open Source Edition.)&lt;/p&gt;

&lt;p&gt;Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x21fea428&quot;&gt;knowledge&lt;/a&gt; of how things actually work.&lt;/p&gt;

&lt;p&gt;So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.&lt;/p&gt;

&lt;p&gt;Three types of individuals need to know about fault tolerance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Executives: What does it cost? Will it really eliminate downtime?&lt;/li&gt;
&lt;li&gt;System Administrators: Is it hard to configure? What do I do when I get an alert?&lt;/li&gt;
&lt;li&gt;Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I will explain the matter to each of these three groups:&lt;/p&gt;

&lt;h2&gt;Executives&lt;/h2&gt;

&lt;p&gt;The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money&amp;#39;s worth of read throughput and half the money&amp;#39;s worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.&lt;/p&gt;

&lt;p&gt;Server instances are grouped in &amp;quot;quorums&amp;quot; of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.&lt;/p&gt;

&lt;p&gt;The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please &lt;a href=&quot;http://www.openlinksw.com/contact/&quot; id=&quot;link-id0x2bdb0db8&quot;&gt;contact us&lt;/a&gt; for specifics.&lt;/p&gt;

&lt;p&gt;Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.&lt;/p&gt;

&lt;h2&gt; System Administrators&lt;/h2&gt;

&lt;p&gt;To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server&amp;#39;s network cables to each switch, to cover switch failures.&lt;/p&gt;

&lt;p&gt;When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.&lt;/p&gt;

&lt;p&gt;Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.&lt;/p&gt;

&lt;h2&gt; Application Developers/Programmers&lt;/h2&gt;

&lt;p&gt;An application can connect to any server instance in the cluster and have access to the same data, with full &lt;a href=&quot;http://dbpedia.org/resource/ACID&quot; id=&quot;link-id0x6451870&quot;&gt;ACID&lt;/a&gt; properties.&lt;/p&gt;

&lt;p&gt;There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.&lt;/p&gt;

&lt;p&gt;For the missing server instance, the application should try to reconnect. An &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x28e859b8&quot;&gt;ODBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x28e11940&quot;&gt;JDBC&lt;/a&gt; connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.&lt;/p&gt;

&lt;p&gt;For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (&lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x2bda4e40&quot;&gt;SQL&lt;/a&gt; State 40001) as best practices dictate, there is no change needed.&lt;/p&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;In summary...&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Limited extra cost for fault tolerance; no equipment sitting idle.&lt;/li&gt;
&lt;li&gt;Easy operation: Replace servers when they fail; the cluster does the rest.&lt;/li&gt;
&lt;li&gt;No changes needed to most applications.&lt;/li&gt;
&lt;li&gt;No proprietary SQL APIs or special fault tolerance logic needed in applications.&lt;/li&gt;
&lt;li&gt;Fully transactional programming model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All the above applies to both the Graph Model (&lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x22606f10&quot;&gt;RDF&lt;/a&gt;) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please &lt;a href=&quot;http://www.openlinksw.com/contact/&quot; id=&quot;link-id0x24f35648&quot;&gt;contact OpenLink Software&lt;/a&gt; Sales for details of availability or for getting advance evaluation copies.&lt;/p&gt;

&lt;h3&gt;
&lt;a href=&quot;http://dbpedia.org/resource/Glossary&quot; id=&quot;link-id0x6648890&quot;&gt;Glossary&lt;/a&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Cluster (VC)&lt;/b&gt; -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Cluster Node (VCN)&lt;/b&gt; -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Host Cluster (VHC)&lt;/b&gt; -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Host Cluster Node (VHCN)&lt;/b&gt; -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.&lt;/li&gt;
&lt;li&gt;
  &lt;b&gt;Virtuoso Server Instance (VSI)&lt;/b&gt; -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs.  May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Also see&lt;/h3&gt;
&lt;ul&gt;
 &lt;li&gt;
  &lt;a href=&quot;http://www.gbcacm.org/sites/www.gbcacm.org/files/slides/SpecialRelativity[1]_0.pdf&quot; id=&quot;link-id0x1320f1e8&quot;&gt;Special Relativity and the Problem of Database Scalability (PDF)&lt;/a&gt;, by James Starkey of &lt;a href=&quot;http://www.nimbusdb.com/&quot; id=&quot;link-id0x1320f2b0&quot;&gt;NimbusDB, Inc.&lt;/a&gt;
 &lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Compare &amp; Contrast: SQL Server&#39;s Linked Server vs Virtuoso&#39;s Virtual Database Layer</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-02-12#1607</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1607#comments</comments><pubDate>Fri, 12 Feb 2010 21:44:10 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2010-02-17T11:21:26-05:00</n0:modified><description>&lt;h2&gt;
&lt;a href=&quot;http://dbpedia.org/resource/Microsoft&quot; id=&quot;link-id166785f0&quot;&gt;Microsoft&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id169b6bb8&quot;&gt;SQL&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Microsoft_SQL_Server&quot; id=&quot;link-id163b8350&quot;&gt;Server&lt;/a&gt;&amp;#39;s Linked Server Promise&lt;/h2&gt;
&lt;p&gt;The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers.&lt;/p&gt;
&lt;p&gt;The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id1675e128&quot;&gt;data&lt;/a&gt; access providers, and even for those that do, results received via different methods may differ.&lt;/p&gt;
&lt;p&gt;Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as &amp;quot;ad-hoc distributed queries&amp;quot;.&lt;/p&gt;
&lt;p&gt;Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them.&lt;/p&gt;
&lt;p&gt;The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end&amp;#39;s abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end&amp;#39;s dialect. This leads to problems especially when the Linked Server is an older variant which doesn&amp;#39;t support SQL-92 (e.g., Progress 8.x or earlier, &lt;a href=&quot;http://dbpedia.org/resource/IBM_Informix&quot; id=&quot;link-id167f6fa0&quot;&gt;Informix&lt;/a&gt; 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, &lt;a href=&quot;http://dbpedia.org/resource/MySQL&quot; id=&quot;link-id166c7848&quot;&gt;MySQL&lt;/a&gt;, etc.).&lt;/p&gt;
&lt;h3&gt;Basic Four-Part Naming&lt;/h3&gt;
&lt;blockquote&gt;
&lt;code&gt;SELECT * &lt;br /&gt;Â Â FROM linked_server.[catalog].[&lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id163c3f78&quot;&gt;schema&lt;/a&gt;].object&lt;/code&gt;
&lt;/blockquote&gt;
&lt;p&gt;Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.&lt;/p&gt;
&lt;h3&gt;OpenQuery&lt;/h3&gt;
&lt;blockquote&gt;
&lt;code&gt;SELECT * &lt;br /&gt;Â Â FROM OPENQUERY ( linked_server , &amp;#39;query&amp;#39; )&lt;/code&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenQuery also presumes that you have pre-defined a Linked Server, but executes the query as a &amp;quot;pass-through&amp;quot;, handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them.&lt;/p&gt;
&lt;h4&gt;From the product docs:&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;
&lt;i&gt;SQL Server&amp;#39;s Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source. &lt;code&gt;OPENQUERY&lt;/code&gt; can be referenced in the &lt;code&gt;FROM&lt;/code&gt; clause of a query as if it were a table name. &lt;code&gt;OPENQUERY&lt;/code&gt; can also be referenced as the target table of an &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, or &lt;code&gt;DELETE&lt;/code&gt; statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets, &lt;code&gt;OPENQUERY&lt;/code&gt; returns only the first one.&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;...&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;&lt;code&gt;OPENQUERY&lt;/code&gt; does not accept variables for its arguments. &lt;code&gt;OPENQUERY&lt;/code&gt; cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name. &lt;/i&gt;
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;OpenRowset&lt;/h3&gt;
&lt;blockquote&gt;
&lt;code&gt;SELECT * 
&lt;br /&gt;Â Â FROM OPENROWSET
&lt;br /&gt;Â Â Â Â ( &amp;#39;provider_name&amp;#39; , &lt;br /&gt;Â Â Â Â Â Â &amp;#39;datasource&amp;#39; ; &amp;#39;user_id&amp;#39; ; &amp;#39;password&amp;#39;, &lt;br /&gt;Â Â Â Â Â Â { [ catalog. ] [ schema. ] object | &amp;#39;query&amp;#39; }&lt;br /&gt;Â Â Â Â )&lt;/code&gt;
&lt;/blockquote&gt;
&lt;p&gt;
&lt;code&gt;OpenRowset&lt;/code&gt; does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both &amp;quot;pass-through&amp;quot; and &amp;quot;local execution&amp;quot; queries, which can lead to confusion when the results differ (as they regularly will).&lt;/p&gt;
&lt;h4&gt;More from product docs:&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;
&lt;i&gt;Includes all connection &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id163ab840&quot;&gt;information&lt;/a&gt; that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The &lt;code&gt;OPENROWSET&lt;/code&gt; function can be referenced in the &lt;code&gt;FROM&lt;/code&gt; clause of a query as if it were a table name. The &lt;code&gt;OPENROWSET&lt;/code&gt; function can also be referenced as the target table of an &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, or &lt;code&gt;DELETE&lt;/code&gt; statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets, &lt;code&gt;OPENROWSET&lt;/code&gt; returns only the first one.&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;OPENROWSET also supports bulk operations through a built-in &lt;code&gt;BULK&lt;/code&gt; provider that enables data from a file to be read and returned as a rowset.&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;...&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;&lt;code&gt;OPENROWSET&lt;/code&gt; can be used to access remote data from OLE DB data sources only when the &lt;code&gt;DisallowAdhocAccess&lt;/code&gt; registry option is explicitly set to &lt;code&gt;0&lt;/code&gt; for the specified provider, and the &lt;code&gt;Ad Hoc Distributed Queries&lt;/code&gt; advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation.&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form &lt;code&gt;schema.object&lt;/code&gt; must be specified. If the provider supports only catalog names, a three-part name of the form &lt;code&gt;catalog.schema.object&lt;/code&gt; must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL). &lt;code&gt;OPENROWSET&lt;/code&gt; does not accept variables for its arguments.&lt;/i&gt;
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;OpenDataSource&lt;/h3&gt;
&lt;blockquote&gt;
&lt;code&gt;SELECT * &lt;br /&gt;Â Â FROM OPENDATASOURCE&lt;br /&gt;Â Â Â Â ( &amp;#39;provider_name&amp;#39;,&lt;br /&gt;Â Â Â Â Â Â &amp;#39;provider_specific_datasource_specification&amp;#39;&lt;br /&gt;Â Â Â Â ).[catalog].[schema].object&lt;/code&gt;
&lt;/blockquote&gt;
&lt;p&gt;As with basic four-part naming, &lt;code&gt;OpenDataSource&lt;/code&gt; executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.&lt;/p&gt;
&lt;h4&gt;Additional doc excerpts&lt;/h4&gt;
&lt;blockquote&gt;
&lt;p&gt;
&lt;i&gt;Provides ad hoc connection information as part of a four-part object name without using a linked server name.&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;...&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;&lt;code&gt;OPENDATASOURCE&lt;/code&gt; can be used to access remote data from OLE DB data sources only when the &lt;code&gt;DisallowAdhocAccess&lt;/code&gt; registry option is explicitly set to &lt;code&gt;0&lt;/code&gt; for the specified provider, and the &lt;code&gt;Ad Hoc Distributed Queries&lt;/code&gt; advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;The &lt;code&gt;OPENDATASOURCE&lt;/code&gt; function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore, &lt;code&gt;OPENDATASOURCE&lt;/code&gt; can be used as the first part of a four-part name that refers to a table or view name in a &lt;code&gt;SELECT&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, or &lt;code&gt;DELETE&lt;/code&gt; statement, or to a remote stored procedure in an &lt;code&gt;EXECUTE&lt;/code&gt; statement. When executing remote stored procedures, &lt;code&gt;OPENDATASOURCE&lt;/code&gt; should refer to another instance of SQL Server. &lt;code&gt;OPENDATASOURCE&lt;/code&gt; does not accept variables for its arguments.&lt;/i&gt;
&lt;/p&gt;
&lt;p&gt;
&lt;i&gt;Like the &lt;code&gt;OPENROWSET&lt;/code&gt; function, &lt;code&gt;OPENDATASOURCE&lt;/code&gt; should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. Neither &lt;code&gt;OPENDATASOURCE&lt;/code&gt; nor &lt;code&gt;OPENROWSET&lt;/code&gt; provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time that &lt;code&gt;OPENDATASOURCE&lt;/code&gt; is called.&lt;/i&gt;
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
&lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id122c66b8&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Virtual_Database&quot; id=&quot;link-id167af7d8&quot;&gt;Virtual Database&lt;/a&gt; Promise &amp;amp; Deliverables&lt;/h2&gt; 
&lt;p&gt;The ability to link objects (tables, views, stored procedures) from any &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id1394ab90&quot;&gt;ODBC&lt;/a&gt;-accessible data source. This includes any &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id11c38748&quot;&gt;JDBC&lt;/a&gt;-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.&lt;/p&gt;
&lt;p&gt;There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.&lt;/p&gt;
&lt;p&gt;All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.&lt;/p&gt;

</description></item><item><title>Compare &amp; Contrast: Oracle Heterogeneous Services (HSODBC, DG4ODBC) vs Virtuoso&#39;s Virtual Database Layer</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-02-12#1606</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1606#comments</comments><pubDate>Fri, 12 Feb 2010 21:43:51 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2010-02-17T11:21:22.000001-05:00</n0:modified><description>&lt;h3&gt;
&lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id12349be8&quot;&gt;Oracle&lt;/a&gt; Gateway Promise&lt;/h3&gt;
&lt;p&gt;Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id167e5760&quot;&gt;SQL&lt;/a&gt; queries against any &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id13c6bfa0&quot;&gt;ODBC&lt;/a&gt;- or OLE-DB-accessible linked back end.&lt;/p&gt;
&lt;h3&gt;Reality&lt;/h3&gt;
&lt;p&gt;Promise fails to materialize for several reasons. Immediate limitations include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All tables locked by a &lt;code&gt;FOR UPDATE&lt;/code&gt; clause and all tables with &lt;code&gt;LONG&lt;/code&gt; columns selected by the query must be located in the same external database.&lt;/li&gt;
&lt;li&gt;Distributed queries cannot select user-defined types or object &lt;code&gt;REF&lt;/code&gt; datatypes on remote tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A table including a &lt;code&gt;BLOB&lt;/code&gt; column must have a separate column that serves as a primary key.&lt;/li&gt;
&lt;li&gt;
  &lt;code&gt;BLOB&lt;/code&gt; and &lt;code&gt;CLOB&lt;/code&gt; &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id163e07f0&quot;&gt;data&lt;/a&gt; cannot be read by passthrough queries.&lt;/li&gt;
&lt;li&gt;Updates or deletes that include unsupported functions within a &lt;code&gt;WHERE&lt;/code&gt; clause are not allowed.&lt;/li&gt;
&lt;li&gt;Generic Connectivity does not support stored procedures.&lt;/li&gt;
&lt;li&gt;Generic Connectivity agents cannot participate in distributed transactions; they support single-site transactions only.&lt;/li&gt;
&lt;li&gt;Generic Connectivity does not support multithreaded agents.&lt;/li&gt;
&lt;li&gt;Updating &lt;code&gt;LONG&lt;/code&gt; columns with bind variables is not supported.&lt;/li&gt;
&lt;li&gt;Generic Connectivity does not support &lt;code&gt;ROWID&lt;/code&gt;s.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function.&lt;/p&gt;
&lt;p&gt;For instance, when a &lt;code&gt;SELECT COUNT (*) FROM table@link&lt;/code&gt; is issued through Oracle SQL, the target DBMS doesn&amp;#39;t simply perform a &lt;code&gt;SELECT COUNT (*) FROM table&lt;/code&gt;.  Rather, it performs a &lt;code&gt;SELECT * FROM table&lt;/code&gt; which is used to inventory all columns in the table, and then performs and fully retrieves &lt;code&gt;SELECT field FROM table&lt;/code&gt; into an internal temporary table, where it does the &lt;code&gt;COUNT (*)&lt;/code&gt; itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support &lt;code&gt;COUNT (*)&lt;/code&gt; (among other functions).&lt;/p&gt;
&lt;h3&gt;
&lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id16814bd8&quot;&gt;Virtuoso&lt;/a&gt;&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Virtual_Database&quot; id=&quot;link-id1185b9d0&quot;&gt;Virtual Database&lt;/a&gt; Comparison&lt;/h3&gt;
&lt;p&gt;The Virtuoso &lt;a href=&quot;http://dbpedia.org/resource/Virtuoso_Universal_Server&quot; id=&quot;link-id1666f658&quot;&gt;Universal Server&lt;/a&gt; will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id1668aec8&quot;&gt;JDBC&lt;/a&gt;-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.&lt;/p&gt;
&lt;p&gt;There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.&lt;/p&gt;
&lt;p&gt;All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local &lt;a href=&quot;http://dbpedia.org/resource/Database_schema&quot; id=&quot;link-id1628c438&quot;&gt;schema&lt;/a&gt;.&lt;/p&gt;
</description></item><item><title>Short Recap of Virtuoso Basics (#3 of 5)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1552</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1552#comments</comments><pubDate>Thu, 30 Apr 2009 15:49:53 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2009-04-30T12:11:45-04:00</n0:modified><description>&lt;p&gt;(Third of five posts related to the &lt;a href=&quot;http://www2009.org/&quot; id=&quot;link-id0x14b582b8&quot;&gt;WWW 2009&lt;/a&gt; conference, held the week of April 20, 2009.)

&lt;/p&gt;
&lt;p&gt;There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.&lt;/p&gt;

&lt;p&gt;
&lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x14bf48b8&quot;&gt;Virtuoso&lt;/a&gt; is a DBMS. We pitch it primarily to the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x16bc4490&quot;&gt;data&lt;/a&gt; web space because this is where we see the emerging frontier. Virtuoso does both &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1223dc30&quot;&gt;SQL&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x170eec88&quot;&gt;SPARQL&lt;/a&gt; and can do both at large scale and high performance. The popular perception of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x15a05fc0&quot;&gt;RDF&lt;/a&gt; and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.&lt;/p&gt;

&lt;p&gt;If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too.  &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfsparqlintegrationmiddleware.html#rdfviews&quot; id=&quot;link-id14ddc7c8&quot;&gt;Rendering application specific data structures as RDF&lt;/a&gt; applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/qsvdbsrv.html&quot; id=&quot;link-id14aaea70&quot;&gt;federate tables from heterogenous DBMS&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;On top of this, there is a &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/qswebserver.html&quot; id=&quot;link-id16fcde60&quot;&gt;web server built in&lt;/a&gt;, so that no extra server is needed for web services, web pages, and the like.&lt;/p&gt;

&lt;p&gt;Installation is simple, just one exe and one config file. There is a huge amount of code in &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/installation.html&quot; id=&quot;link-id16767b40&quot;&gt;installers&lt;/a&gt; â application code and test suites and such â but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.&lt;/p&gt;

&lt;p&gt;Clusters (coming in Release 6) and SQL federation are &lt;a href=&quot;http://download.openlinksw.com/download/product_matrix.vsp?p=l_os&amp;amp;c=39&amp;amp;df=16&quot; id=&quot;link-id16722550&quot;&gt;commercial only&lt;/a&gt;; the rest can be had &lt;a href=&quot;http://sourceforge.net/project/showfiles.php?group_id=161622&quot; id=&quot;link-id131080a8&quot;&gt;under GPL&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To condense further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalable Delivery of &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1060ad98&quot;&gt;Linked Data&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SPARQL and SQL
&lt;ul&gt;
    &lt;li&gt;Arbitrary RDF Data + Relational&lt;/li&gt;
&lt;li&gt;Also From 3rd Party &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x16bbce60&quot;&gt;RDBMS&lt;/a&gt;
    &lt;/li&gt;
  &lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Easy Deployment &lt;/li&gt;
&lt;li&gt;Standard Interfaces
&lt;ul&gt;
    &lt;li&gt;
      &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x12e284d8&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0xb5e1400&quot;&gt;JDBC&lt;/a&gt;, OLE DB, &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x15a55db8&quot;&gt;ADO&lt;/a&gt;.&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x16beb070&quot;&gt;NET&lt;/a&gt;, XMLA&lt;/li&gt;
&lt;li&gt;
      &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x122b5008&quot;&gt;Jena&lt;/a&gt;, &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x148d4078&quot;&gt;Sesame&lt;/a&gt;, etc.&lt;/li&gt;
&lt;li&gt;All Web Protocols &lt;/li&gt;
  &lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
</description></item><item><title>Virtuoso RDF:  A Getting Started Guide for the Developer</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1505</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1505#comments</comments><pubDate>Wed, 17 Dec 2008 12:31:34 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-17T12:41:27.000006-05:00</n0:modified><description>
&lt;p&gt;It is a long standing promise of mine to dispel the false impression that using &lt;a href=&quot;http://virtuoso.openlinksw.com/&quot; id=&quot;link-id113506d0&quot;&gt;Virtuoso&lt;/a&gt; to work with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id115d9528&quot;&gt;RDF&lt;/a&gt; is complicated.&lt;/p&gt;

&lt;p&gt;The purpose of this presentation is to show a programmer how to put RDF into Virtuoso and how to query it.  This is done programmatically, with no confusing user interfaces.&lt;/p&gt;

&lt;p&gt;You should have a Virtuoso Open Source tree built and installed.  We will look at the LUBM benchmark demo that comes with the package.  All you need is a Unix shell.  Running the shell under emacs (&lt;code&gt;m-x shell&lt;/code&gt;) is the best.  But the open source &lt;code&gt;isql&lt;/code&gt; utility should have command line editing also.  The emacs shell is however convenient for cutting and pasting things between shell and files.&lt;/p&gt;

&lt;p&gt;To get started, cd into &lt;code&gt;binsrc/tests/lubm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To verify that this works, you can do &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;./test_server.sh virtuoso-t&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This will test the server with the LUBM queries.  This should report 45 tests passed.  After this we will do the tests step-by-step.&lt;/p&gt;

&lt;h2&gt;Loading the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id10f7bd90&quot;&gt;Data&lt;/a&gt;
&lt;/h2&gt; 

&lt;p&gt;The file &lt;code&gt;lubm-load.sql&lt;/code&gt; contains the commands for loading the LUBM single university qualification database.&lt;/p&gt;

&lt;p&gt;The data files themselves are in &lt;code&gt;lubm_8000&lt;/code&gt;, 15 files in RDFXML.&lt;/p&gt;

&lt;p&gt;There is also a little ontology called &lt;code&gt;inf.nt&lt;/code&gt;.  This declares the subclass and subproperty relations used in the benchmark.&lt;/p&gt;

&lt;p&gt;So now let&amp;#39;s go through this procedure.&lt;/p&gt;

&lt;p&gt;Start the server:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;$ virtuoso-t -f &amp;amp;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This starts the server in foreground mode, and puts it in the background of the shell.&lt;/p&gt;

&lt;p&gt;Now we connect to it with the isql utility.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;$ isql 1111 dba dba 
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This gives a &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.  The default username and password are both &lt;code&gt;dba&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When a command is &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1176ce70&quot;&gt;SQL&lt;/a&gt;, it is entered directly.  If it is &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id156df468&quot;&gt;SPARQL&lt;/a&gt;, it is prefixed with the keyword &lt;code&gt;sparql&lt;/code&gt;.  This is how all the SQL clients work.  Any SQL client, such as any &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id152d0a00&quot;&gt;ODBC&lt;/a&gt; or &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id157ad6a0&quot;&gt;JDBC&lt;/a&gt; application, can use SPARQL if the SQL string starts with this keyword.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;lubm-load.sql&lt;/code&gt; file is quite self-explanatory. It begins with defining an SQL procedure that calls the RDF/XML load function, &lt;code&gt;DB..RDF_LOAD_RDFXML&lt;/code&gt;, for each file in a directory.&lt;/p&gt;

&lt;p&gt;Next it calls this function for the &lt;code&gt;lubm_8000&lt;/code&gt; directory under the server&amp;#39;s working directory.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   CLEAR GRAPH &amp;lt;lubm&amp;gt;;

sparql 
   CLEAR GRAPH &amp;lt;inf&amp;gt;;

load_lubm ( server_root() || &amp;#39;/lubm_8000/&amp;#39; );
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then it verifies that the right number of triples is found in the &amp;lt;lubm&amp;gt; graph.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   SELECT COUNT(*) 
     FROM &amp;lt;lubm&amp;gt; 
    WHERE { ?x ?y ?z } ;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;The echo commands below this are interpreted by the isql utility, and produce output to show whether the test was passed.  They can be ignored for now.&lt;/p&gt;

&lt;p&gt;Then it adds some implied &lt;code&gt;subOrganizationOf&lt;/code&gt; triples.  This is part of setting up the LUBM test database.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;sparql 
   PREFIX  ub:  &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt;
   INSERT 
      INTO GRAPH &amp;lt;lubm&amp;gt; 
      { ?x  ub:subOrganizationOf  ?z } 
   FROM &amp;lt;lubm&amp;gt; 
   WHERE { ?x  ub:subOrganizationOf  ?y  . 
           ?y  ub:subOrganizationOf  ?z  . 
         };
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then it loads the ontology file, &lt;code&gt;inf.nt&lt;/code&gt;, using the Turtle load function, &lt;code&gt;DB.DBA.TTLP&lt;/code&gt;.  The arguments of the function are the text to load, the default namespace prefix, and the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id15835550&quot;&gt;URI&lt;/a&gt; of the target graph.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;DB.DBA.TTLP ( file_to_string ( &amp;#39;inf.nt&amp;#39; ), 
              &amp;#39;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl&amp;#39;, 
              &amp;#39;inf&amp;#39; 
            ) ;
sparql 
   SELECT COUNT(*) 
     FROM &amp;lt;inf&amp;gt; 
    WHERE { ?x ?y ?z } ;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Then we declare that the triples in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph can be used for inference at run time.  To enable this, a SPARQL query will declare that it uses the &lt;code&gt;&amp;#39;inft&amp;#39;&lt;/code&gt; rule set.  Otherwise this has no effect.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;rdfs_rule_set (&amp;#39;inft&amp;#39;, &amp;#39;inf&amp;#39;);
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;This is just a log checkpoint to finalize the work and truncate the transaction log.  The server would also eventually do this in its own time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;checkpoint;
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;Now we are ready for querying.&lt;/p&gt;

&lt;h2&gt;Querying the Data&lt;/h2&gt; 

&lt;p&gt;The queries are given in 3 different versions: The first file, &lt;code&gt;lubm.sql&lt;/code&gt;, has the queries with most inference open coded as &lt;code&gt;UNIONs&lt;/code&gt;. The second file, &lt;code&gt;lubm-inf.sql&lt;/code&gt;, has the inference performed at run time using the ontology &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id1109faf0&quot;&gt;information&lt;/a&gt; in the &lt;code&gt;&amp;lt;inf&amp;gt;&lt;/code&gt; graph we just loaded.  The last, &lt;code&gt;lubm-phys.sql&lt;/code&gt;, relies on having the entailed triples physically present in the &lt;code&gt;&amp;lt;lubm&amp;gt;&lt;/code&gt; graph.  These entailed triples are inserted by the SPARUL commands in the &lt;code&gt;lubm-cp.sql&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;If you wish to run all the commands in a SQL file, you can type &lt;code&gt;load &amp;lt;filename&amp;gt;;&lt;/code&gt; (e.g., &lt;code&gt;load lubm-cp.sql;&lt;/code&gt;) at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt. If you wish to try individual statements, you can paste them to the command line.&lt;/p&gt;

&lt;p&gt;For example: &lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;SQL&amp;gt; sparql 
   PREFIX ub: &amp;lt;http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#&amp;gt;
   SELECT * 
     FROM &amp;lt;lubm&amp;gt;
    WHERE { ?x  a                     ub:Publication                                                . 
            ?x  ub:publicationAuthor  &amp;lt;http://www.Department0.University0.edu/AssistantProfessor0&amp;gt; 
          };

VARCHAR
_______________________________________________________________________

http://www.Department0.University0.edu/AssistantProfessor0/Publication0
http://www.Department0.University0.edu/AssistantProfessor0/Publication1
http://www.Department0.University0.edu/AssistantProfessor0/Publication2
http://www.Department0.University0.edu/AssistantProfessor0/Publication3
http://www.Department0.University0.edu/AssistantProfessor0/Publication4
http://www.Department0.University0.edu/AssistantProfessor0/Publication5

6 Rows. -- 4 msec.
&lt;/pre&gt;&lt;/blockquote&gt;


&lt;p&gt;To stop the server, simply type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt.&lt;/p&gt;

&lt;p&gt;If you wish to use a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id11384668&quot;&gt;SPARQL protocol&lt;/a&gt; end point, just enable the HTTP listener.  This is done by adding a stanza like â&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre&gt;[HTTPServer]
ServerPort    = 8421
ServerRoot    = .
ServerThreads = 2
&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;â to the end of the &lt;code&gt;virtuoso.ini&lt;/code&gt; file in the &lt;code&gt;lubm&lt;/code&gt; directory.  Then shutdown and restart (type &lt;code&gt;shutdown;&lt;/code&gt; at the &lt;code&gt;SQL&amp;gt;&lt;/code&gt; prompt and then &lt;code&gt;virtuoso-t -f &amp;amp;&lt;/code&gt; at the shell prompt).&lt;/p&gt;

&lt;p&gt;Now you can connect to the end point with a web browser.  The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id113d02d8&quot;&gt;URL&lt;/a&gt; is &lt;code&gt;http://localhost:8421/sparql&lt;/code&gt;. Without parameters, this will show a human readable form.  With parameters, this will execute SPARQL.&lt;/p&gt;

&lt;p&gt;We have shown how to load and query RDF with Virtuoso using the most basic SQL tools. Next you can access RDF from, for example, &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id142d0ba0&quot;&gt;PHP&lt;/a&gt;, using the PHP ODBC interface.&lt;/p&gt;

&lt;p&gt;To see how to use &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id117074f0&quot;&gt;Jena&lt;/a&gt; or &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id1103c9b0&quot;&gt;Sesame&lt;/a&gt; with Virtuoso, look at &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/rdfnativestorageproviders.html&quot; id=&quot;link-id15488ce8&quot;&gt;Native RDF Storage Providers&lt;/a&gt;. To see how RDF data types are supported, see &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/VirtuosoDriverJDBC.html#jdbcrdf&quot; id=&quot;link-id15784a40&quot;&gt;Extension datatype for RDF&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;To work with large volumes of data, you must add memory to the configuration file and use the row-autocommit mode, i.e., do &lt;code&gt;log_enableÂ (2);&lt;/code&gt; before the load command. Otherwise Virtuoso will do the entire load as a single transaction, and will run out of rollback space.  See &lt;a href=&quot;http://docs.openlinksw.com/virtuoso/&quot; id=&quot;link-id111410f0&quot;&gt;documentation&lt;/a&gt; for more.&lt;/p&gt;</description></item><item><title>See the Lite:  Embeddable/Background Virtuoso starts at 25MB</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1503</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1503#comments</comments><pubDate>Wed, 17 Dec 2008 09:34:12 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-12-17T12:03:49-05:00</n0:modified><description>&lt;p&gt;We have received many requests for an embeddable-scale &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1cd69650&quot;&gt;Virtuoso&lt;/a&gt;.  In response to this, we have added a Lite mode, where the initial size of a server process is a tiny fraction of what the initial size would be with default settings.  With 2MB of disk cache buffers (ini file setting, &lt;code&gt;NumberOfBuffers = 256&lt;/code&gt;), the process size stays under 30MB on 32-bit Linux.&lt;/p&gt;

&lt;p&gt;The value of this is that one can now have &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1ce89340&quot;&gt;RDF&lt;/a&gt; and full text indexing on the desktop without running a Java VM or any other memory-intensive software.  And of course, all of &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1cfc9288&quot;&gt;SQL&lt;/a&gt; (transactions, stored procedures, etc.) is in the same embeddably-sized container.&lt;/p&gt;

&lt;p&gt;The Lite executable is a full Virtuoso executable; the Lite mode is controlled by a switch in the configuration file.  The executable size is about 10MB for 32-bit Linux.  A database created in the Lite mode will be converted into a fully-featured database (tables and indexes are added, among other things) if the server is started with the Lite setting &amp;quot;off&amp;quot;; functionality can be reverted to Lite mode, though it will now consume somewhat more memory, etc.&lt;/p&gt;

&lt;p&gt;Lite mode offers full SQL and &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1c511da8&quot;&gt;SPARQL&lt;/a&gt;/SPARUL (via SPASQL), but disables all &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x1dac1950&quot;&gt;HTTP&lt;/a&gt;-based services (WebDAV, application hosting, etc.).  Clients can still use all typical database access mechanisms (i.e., &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0xb19a488&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1d93ee40&quot;&gt;JDBC&lt;/a&gt;, OLE-DB, &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x1ce391c0&quot;&gt;ADO&lt;/a&gt;.&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0xacf1168&quot;&gt;NET&lt;/a&gt;, and XMLA) to connect, including the &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0xaaf5b58&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x1b1e4328&quot;&gt;Sesame&lt;/a&gt; frameworks for RDF.  ODBC now offers full support of RDF &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1cfc9f78&quot;&gt;data&lt;/a&gt; types for &lt;a href=&quot;http://dbpedia.org/resource/C%2B%2B&quot; id=&quot;link-id0xa6059d8&quot;&gt;C&lt;/a&gt;-based clients.  A Redland-compatible API also exists, for use with Redland v1.0.8 and later. &lt;/p&gt;

&lt;p&gt;Especially for embedded use, we now allow restricting the listener to be a Unix socket, which allows client connections only from the localhost.&lt;/p&gt;

&lt;p&gt;Shipping an embedded Virtuoso is easy.  It just takes one executable and one configuration file.  Performance is generally comparable to &amp;quot;normal&amp;quot; mode, except that Lite will be somewhat less scalable on multicore systems.&lt;/p&gt;

&lt;p&gt;The Lite mode will be included in the next Virtuoso 5 Open Source release.&lt;/p&gt;</description></item><item><title>Virtuoso - Are We Too Clever for Our Own Good? (updated)</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1467</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1467#comments</comments><pubDate>Sun, 26 Oct 2008 12:15:35 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-10-27T12:07:58-04:00</n0:modified><description>&lt;p&gt;&amp;quot;Physician, heal thyself,&amp;quot; it is said. We profess to say what the messaging of the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x1b4a25f0&quot;&gt;semantic web&lt;/a&gt; ought to be, but is our own perfect?&lt;/p&gt;

&lt;p&gt;I will here engage in some critical introspection as well as amplify on some answers given to &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1e4f9928&quot;&gt;Virtuoso&lt;/a&gt;-related questions in recent times.&lt;/p&gt;

&lt;p&gt;I use some conversations from the &lt;a href=&quot;http://dbpedia.org/resource/Vienna&quot; id=&quot;link-id0x1e6c0ca8&quot;&gt;Vienna&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x1e56df88&quot;&gt;Linked Data&lt;/a&gt; Practitioners meeting as a starting point. These views are mine and are limited to the Virtuoso server. These do not apply to the &lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x1e680440&quot;&gt;ODS&lt;/a&gt; (&lt;a href=&quot;http://dbpedia.org/resource/OpenLink_Data_Spaces&quot; id=&quot;link-id0x1e140068&quot;&gt;OpenLink Data Spaces&lt;/a&gt;) applications line, &lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1f4ba630&quot;&gt;OAT&lt;/a&gt; (&lt;a href=&quot;http://oat.openlinksw.com/&quot; id=&quot;link-id0x1ba4bac8&quot;&gt;OpenLink Ajax Toolkit&lt;/a&gt;), or &lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1d4159b0&quot;&gt;ODE&lt;/a&gt; (&lt;a href=&quot;http://ode.openlinksw.com/&quot; id=&quot;link-id0x1e973c80&quot;&gt;OpenLink Data Explorer&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;&amp;quot;It is not always clear what the main thrust is, we get the impression that you are spread too thin,&amp;quot; said &lt;a href=&quot;http://www.informatik.uni-leipzig.de/~auer/foaf.rdf#me&quot; id=&quot;link-id0x1f8bafe0&quot;&gt;SÃ¶ren Auer&lt;/a&gt;.&lt;/h3&gt;

&lt;p&gt;Well, personally, I am all for core competence. This is why I do not participate in all the online conversations and groups as much as I could, for example. Time and energy are critical resources and must be invested where they make a difference. In this case, the real core competence is running in the database race. This in itself, come to think of it, is a pretty broad concept.&lt;/p&gt;

&lt;p&gt;This is why we put a lot of emphasis on Linked Data and the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x200bd1f0&quot;&gt;Data&lt;/a&gt; Web for now, as this is the emerging game. This is a deliberate choice, not an outside imperative or built-in limitation. More specifically, this means exposing any pre-existing relational data as linked data plus being the definitive &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0x1fb03528&quot;&gt;RDF&lt;/a&gt; store.&lt;/p&gt;

&lt;p&gt;We can do this because we own our database and &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1e7dcc70&quot;&gt;SQL&lt;/a&gt; and data access middleware and have a history of connecting to any &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x1e9baf18&quot;&gt;RDBMS&lt;/a&gt; out there.&lt;/p&gt;

&lt;p&gt;The principal message we have been hearing from the RDF field is the call for scale of triple storage. This is even louder than the call for relational mapping. We believe that in time mapping will exceed triple storage as such, once we get some real production strength mappings deployed, enough to outperform RDF warehousing.&lt;/p&gt;

&lt;p&gt;There are also RDF middleware things like RDF-ization and demand-driven web harvesting (i.e, the so-called Sponger). These are &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x1f5f6b78&quot;&gt;SPARQL&lt;/a&gt; options, thus accessed via standard interfaces. We have little desire to create our own languages or APIs, or to tell people how to program. This is why we recently introduced &lt;a href=&quot;http://sourceforge.net/projects/sesame/&quot; id=&quot;link-id0x206818c8&quot;&gt;Sesame&lt;/a&gt;- and &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x202b3348&quot;&gt;Jena&lt;/a&gt;-compatible APIs to our RDF store. From what we hear, these work. On the other hand, we do not hesitate to move beyond the standards when there is obvious value or necessity. This is why we brought SPARQL up to and beyond SQL expressivity. It is not a case of E3 (Embrace, Extend, Extinguish).&lt;/p&gt;

&lt;p&gt;Now, this message could be better reflected in our material on the web. This &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x1c82e508&quot;&gt;blog&lt;/a&gt; is a rather informal step in this direction; more is to come. For now we concentrate on delivering.&lt;/p&gt;

&lt;p&gt;The conventional communications wisdom is to split the message by target audience. For this, we should split the RDF, relational, and web services messages from each other. We believe that a challenger, like the semantic web technology stack, must have a compelling message to tell for it to be interesting. This is not a question of research prototypes. The new technology cannot lack something the installed technology takes for granted.&lt;/p&gt;

&lt;p&gt;This is why we do not tend to show things like how to insert and query a few triples: No business out there will insert and query triples for the sake of triples. There must be a more compelling story â for example, turning the whole world into a database. This is why our examples start with things like turning the &lt;a href=&quot;http://dbpedia.org/resource/TPC-H&quot; id=&quot;link-id0x20832510&quot;&gt;TPC-H&lt;/a&gt; database into RDF, queries and all. Anything less is not interesting. Why would an enterprise that has business intelligence and integration issues way more complex than the rather stereotypical TPC-H even look at a technology that pretends to be all for integration and all for expressivity of queries, yet cannot answer the first question of the entry exam?&lt;/p&gt;

&lt;p&gt;The world out there is complex. But maybe we ought to make some simple tutorials? So, as a call to the people out there, tell us what a good tutorial would be. The question is more about figuring out what is out there and adapting these and making a sort of compatibility list.  Jena and Sesame stuff ought to run as is. We could offer a webinar to all the data web luminaries showing how to promote the data web message with Virtuoso. After all, why not show it on the best platform?&lt;/p&gt;

&lt;h3&gt;&amp;quot;You are arrogant. When I read your papers or documentation, the impression I get is that you say you are smart and the reader is stupid.&amp;quot;&lt;/h3&gt;

&lt;p&gt;We should answer in multiple  parts.&lt;/p&gt;

&lt;p&gt;For general collateral, like web sites and documentation:&lt;/p&gt;

&lt;p&gt;The web site gives a confused product image.  For the Virtuoso product, we should divide at the top into&lt;/p&gt;

&lt;ul&gt;  
&lt;li&gt; Data web and RDF - Host linked data, expose relational assets as linked data;&lt;/li&gt;
&lt;li&gt; Relational Database - Full function, high performance, open source, Federated/Virtual Relational DBMS, expose heterogeneous RDB assets through one point of contact for integration;&lt;/li&gt;
&lt;li&gt; Web Services - access all the above over standard protocols, dynamic web pages, web hosting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each point, one simple statement.  We all know what the above things mean?&lt;/p&gt;

&lt;p&gt;Then we add a new point about scalability that impacts all the above, namely the Virtuoso version 6 Cluster, meaning that you can do all these things at 10 to 1000 times the scale. This means this much more data or in some cases this much more requests per second. This too is clear.&lt;/p&gt;

&lt;p&gt;Far as I am concerned, hosting Java or .&lt;a href=&quot;http://dbpedia.org/resource/.NET_Framework&quot; id=&quot;link-id0x20283a88&quot;&gt;NET&lt;/a&gt; does not have to be on the front page. Also, we have no great interest in going against &lt;a href=&quot;http://dbpedia.org/resource/Apache&quot; id=&quot;link-id0x2024a068&quot;&gt;Apache&lt;/a&gt; when it comes to a web server only situation. The fact that we have a web listener is important for some things but our claim to fame does not rest on this.&lt;/p&gt;

&lt;p&gt;Then for documentation and training materials: The documentation should be better. Specifically it should have more of a how-to dimension since nobody reads the whole thing anyhow. About online tutorials, the order of presentation should be different. They do not really reflect what is important at the present moment either.&lt;/p&gt;

&lt;p&gt;Now for conference papers: Since taking the data web as a focus area, we have submitted some papers and had some rejected because these do not have enough references and do not explain what is obvious to ourselves.&lt;/p&gt;

&lt;p&gt;I think that the communications failure in this case is that we want to talk about end to end solutions and the reviewers expect research. For us, the solution is interesting and exists only if there is an adequate functionality mix for addressing a specific use case. This is why we do not make a paper about query cost model alone because the cost model, while indispensable, is a thing that is taken for granted where we come from. So we mention RDF adaptations to cost model, as these are important to the whole but do not find these to be the justification for a whole paper. If we made papers on this basis, we would have to make five times as many. Maybe we ought to.&lt;/p&gt;

&lt;h3&gt;&amp;quot;Virtuoso is very big and very difficult&amp;quot;&lt;/h3&gt;

&lt;p&gt;One thing that is not obvious from the Virtuoso packaging is that the minimum installation is an executable under 10MB and a config file. Two files.&lt;/p&gt;

&lt;p&gt;This gives you SQL and SPARQL out of the box.  Adding &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x1ee61058&quot;&gt;ODBC&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1b8c31c0&quot;&gt;JDBC&lt;/a&gt; clients is as simple as it gets. After this, there is basic database functionality. Tuning is a matter of a few parameters that are explained on this blog and elsewhere. Also, the full scale installation is available as an Amazon EC2 image, so no installation required.&lt;/p&gt;

&lt;p&gt;Now for the difficult side:&lt;/p&gt;

&lt;p&gt;Use SQL and SPARQL; use stored procedures whenever there is server side business logic. For some time critical web pages, use VSP. Do not use VSPX. Otherwise, use whatever you are used to â &lt;a href=&quot;http://dbpedia.org/resource/PHP&quot; id=&quot;link-id0x20a13c00&quot;&gt;PHP&lt;/a&gt; or Java or anything else. For web services, simple is best. Stick to basics. &amp;quot;The engineer is one who can invent a simple thing.&amp;quot; Use SQL statements rather than admin UI.&lt;/p&gt;

&lt;p&gt;Know that you can start a server with no database file and you get an initial database with nothing extra. The demo database, the way it is produced by installers is cluttered.&lt;/p&gt;

&lt;p&gt;We should put this into a couple of use case oriented how-tos.&lt;/p&gt;

&lt;p&gt;Also, we should create a network of &amp;quot;friendly local virtuoso geeks&amp;quot; for providing basic training and services so we do not have to explain these things all the time. To all you data-web-ers out there â please sign up and we will provide instructions, etc. Contact YrjÃ¤nÃ¤ Rankka (ghard[at-sign]openlinksw.com), or go through the mailing lists; do not contact me directly.&lt;/p&gt;

&lt;h3&gt;&amp;quot;OK, we understand that you may be good at the large end of the spectrum but how do you reconcile this with the lightweight or embedded end, like the semantic desktop?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Now, what is good for one end is usually good for the other. Namely, a database, no matter the scale, needs to have space efficient storage, fast index lookup, and correct query plans. Then there are things that occur only at the high-end, like clustering, but these are separate things. For embedding, the initial memory footprint needs to be small. With Virtuoso, this is accomplished by leaving out some 200 built-in tables and 100,000 lines of SQL procedures that are normally in by default, supporting things such as DAV and diverse other protocols. After all, if SPARQL is all one wants these are not needed.&lt;/p&gt;

&lt;p&gt;If one really wants to do one&amp;#39;s server logic (like web listener and thread dispatching) oneself, this is not impossible but requires some advice from us. On the other hand, if one wants to have logic for security close to the data, then using stored procedures is recommended; these execute right next to the data, and support inline SPARQL and SQL. Depending on the license status of the other code, some special licensing arrangements may apply.&lt;/p&gt;

&lt;p&gt;We are talking about such things with different parties at present.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How webby are you?  What is webby?&amp;quot;&lt;/h3&gt;

&lt;p&gt;&amp;quot;Webby means distributed, heterogeneous, open; not monolithic consolidation of everything.&amp;quot;&lt;/p&gt;

&lt;p&gt;We are philosophically webby. We come from open standards; we are after all called OpenLink; our history consists of connecting things. We believe in choice â the user should be able to pick the best of breed for components and have them work together. We cannot and do not wish to force replacement of existing assets. Transforming data on the fly and connecting systems, leaving data where it originally resides, is the first preference. For the data web, the first preference is a federation of independent SPARQL end points. When there is harvesting, we prefer to do it on demand, as with our Sponger. With the immense amount of data out there we believe in finding what is relevant &lt;i&gt;when&lt;/i&gt; it is relevant, preferably close at hand, leveraging things like social networks. With a data web, many things which are now siloized, such as marketplaces and social networks, will return to the open.&lt;/p&gt;

&lt;p&gt;Google-style crawling of everything becomes less practical if one needs to run complex &lt;i&gt;ad hoc&lt;/i&gt; queries against the mass of data. For these types of scenarios, if one needs to warehouse, the data cloud will offer solutions where one pays for database on demand. While we believe in loosely coupled federation where possible, we have serious work on the scalability side for the data center and the compute-on-demand cloud.&lt;/p&gt;

&lt;h3&gt;&amp;quot;How does OpenLink see the next five years unfolding?&amp;quot;&lt;/h3&gt;

&lt;p&gt;Personally, I think we have the basics for the birth of a new inflection in the &lt;a href=&quot;http://dbpedia.org/resource/Knowledge&quot; id=&quot;link-id0x1fb9ae58&quot;&gt;knowledge&lt;/a&gt; economy. The &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Identifier&quot; id=&quot;link-id0x1f07c648&quot;&gt;URI&lt;/a&gt; is the unit of exchange; its value and competitive edge lie in the data it links you with. A name without context is worth little, but as a name gets more use, more &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1f007d60&quot;&gt;information&lt;/a&gt; can be found through that name. This is anything from financial statistics, to legal precedents, to news reporting or government data. Right now, if the SEC just added one line of markup to the XBRL template, this would instantaneously make all SEC-mandated reporting into linked data via GRDDL.&lt;/p&gt;

&lt;p&gt;The URI is a carrier of brand. An information brand gets traffic and references, and this can be monetized in diverse ways. The key word is &lt;i&gt;context&lt;/i&gt;. Information overload is here to stay, and only better context offers the needed increase in productivity to stay ahead of the flood.&lt;/p&gt;

&lt;p&gt;Semantic technologies on the whole can help with this. Why these should be semantic web or data web technologies as opposed to just semantic is the linked data value proposition. Even smart islands are still islands. Agility, scale, and scope, depend on the possibility of combining things. Therefore common terminologies and dereferenceability and discoverability are important. Without these, we are at best dealing with closed systems even if they were smart. The expert systems of the 1980s are a case in point.&lt;/p&gt;

&lt;p&gt;Ever since the .com era, the &lt;a href=&quot;http://dbpedia.org/resource/Uniform_Resource_Locator&quot; id=&quot;link-id0x2048e670&quot;&gt;URL&lt;/a&gt; has been a brand. Now it becomes a URI. Thus, entirely hiding the URI from the user experience is not always desirable. The URI is a sort of handle on the provenance and where more can be found; besides, people are already used to these.&lt;/p&gt;

&lt;p&gt;With linked data, information value-add products become easy to build and deploy. They can be basically just canned SPARQL queries combining data in a useful and insightful manner. And where there is traffic there can be monetization, whether by advertizing, subscription, or other means. Such possibilities are a natural adjunct to the blogosphere. To publish analysis, one no longer needs to be a think tank or media company. We could call this scenario the birth of a meshup economy.&lt;/p&gt;

&lt;p&gt;For OpenLink itself, this is our roadmap. The immediate future is about getting our high end offerings like clustered RDF storage generally available, both on the cloud and for private data centers. Ourselves, we will offer the whole &lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0x1c696170&quot;&gt;Linked Open Data&lt;/a&gt; cloud as a database. The single feature to come in version 2 of this is fully automatic partitioning and repartitioning for on-demand scale; now, you have to choose how many partitions you have.&lt;/p&gt;

&lt;p&gt;This makes some things possible that were hard thus far.&lt;/p&gt;

&lt;p&gt;On the mapping front, we go for real-scale data integration scenarios where we can show that SPARQL can unify terms and concepts across databases, yet bring no added cost for complex queries. Enterprises can use their existing warehouses and have an added level of abstraction, the possibility of cross systems interlinking, the advantages of using the same taxonomies and ontologies across systems, and so forth.&lt;/p&gt;

&lt;p&gt;Then there will be developments in the direction of smarter web harvesting on demand with the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x206ab780&quot;&gt;Sponger&lt;/a&gt;, and federation of heterogeneous SPARQL end points. The federation is not so unlike clustering, except the time scales are 2 orders of magnitude longer. The work on SPARQL end point statistics and data set description and discovery is a good development in the community.&lt;/p&gt;

&lt;p&gt;Then there will be NLP integration, as exemplified by the Open Calais linked data wrapper and more.&lt;/p&gt;

&lt;p&gt;Can we pull this off or is this being spread too thin? We know from experience that all this can be accomplished. Scale is already here; we show it with the billion triples set. Mapping is here; we showed it last in the Berlin Benchmark. We will also show some TPC-H results after we get a little quiet after the ISWC event.  Then there is ongoing maintenance but with this we have shown a steady turnaround and quick time to fix for pretty much anything.&lt;/p&gt;</description></item><item><title>Transitivity and Graphs for SQL</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1435</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1435#comments</comments><pubDate>Mon, 08 Sep 2008 09:41:24 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-09-08T15:43:07-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;Transitivity and Graphs for SQL&lt;/div&gt;
&lt;h2&gt;Background&lt;/h2&gt; 

&lt;p&gt;I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xa1a18c58&quot;&gt;SQL&lt;/a&gt; query language.&lt;/p&gt;

&lt;p&gt;The history of databases is by and large about moving from specialized applications toward a generic platform. The introduction of the DBMS itself is the archetypal example.  It is all about extracting the common features of applications and making these the features of a platform instead.&lt;/p&gt;

&lt;p&gt;It is now time to apply this principle to graph traversal.&lt;/p&gt;

&lt;p&gt;The rationale is that graph operations are somewhat tedious to write in a parallelize-able, latency-tolerant manner. Writing them as one would for memory-based &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xaf8c730&quot;&gt;data&lt;/a&gt; structures is easier but totally unscalable as soon as there is any latency involved, i.e., disk reads or messages between cluster peers.&lt;/p&gt;

&lt;p&gt;The ad-hoc nature and very large volume of &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xae41ef0&quot;&gt;RDF&lt;/a&gt; data makes this a timely question.  Up until now, the answer to this question has been to materialize any implied facts in RDF stores.  If &lt;i&gt;a&lt;/i&gt; was part of &lt;i&gt;b&lt;/i&gt;, and &lt;i&gt;b&lt;/i&gt; part of &lt;i&gt;&lt;a href=&quot;http://dbpedia.org/resource/C_(programming_language)&quot; id=&quot;link-id0xac9d8790&quot;&gt;c&lt;/a&gt;&lt;/i&gt;, the implied fact that &lt;i&gt;a&lt;/i&gt; is part of &lt;i&gt;c&lt;/i&gt; would be inserted explicitly into the database as a pre-query step.&lt;/p&gt;

&lt;p&gt;This is simple and often efficient, but tends to have the downside that one makes a specialized warehouse for each new type of query.  The activity becomes less ad-hoc.&lt;/p&gt;

&lt;p&gt;Also, this becomes next to impossible when the scale approaches web scale, or if some of the data is liable to be on-and-off included-into or excluded-from the set being analyzed.  This is why with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xb68f9d0&quot;&gt;Virtuoso&lt;/a&gt; we have tended to favor inference on demand (&amp;quot;backward chaining&amp;quot;) and mapping of relational data into RDF without copying.&lt;/p&gt;

&lt;p&gt;The SQL world has taken steps towards dealing with recursion with the &lt;code&gt;WITH - UNION&lt;/code&gt; construct which allows definition of recursive views.  The idea there is to define, for example, a tree walk as a &lt;code&gt;UNION&lt;/code&gt; of the data of the starting node plus the recursive walk of the starting node&amp;#39;s immediate children.&lt;/p&gt;

&lt;p&gt;The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving &lt;code&gt;JOIN&lt;/code&gt;s between such recursive views.  This model of recursion seems to lose SQL&amp;#39;s non-procedural nature.  One can no longer easily rearrange &lt;code&gt;JOIN&lt;/code&gt;s based on what data is given and what is to be retrieved.  If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root.  At any rate, queries written in this way are so complex to write, let alone optimize, that I decided to take another approach.&lt;/p&gt;

&lt;p&gt;Take a question like &amp;quot;list the parts of products of category &lt;i&gt;C&lt;/i&gt; which have materials that are classified as toxic.&amp;quot;  Suppose that the product categories are a tree, the product parts are a tree, and the materials classification is a tree taxonomy where &amp;quot;toxic&amp;quot; has a multilevel substructure.&lt;/p&gt;

&lt;p&gt;Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product, and up the product hierarchy to see if the product is in the right category.  One should be able to evaluate the identical query either way depending on what indices exist, what the cardinalities of the relations are, and so forth â regular cost based optimization.&lt;/p&gt;

&lt;p&gt;Especially with RDF, there are many problems of this type.  In regular SQL, it is a long-standing cultural practice to flatten hierarchies, but this is not the case with RDF.&lt;/p&gt;

&lt;p&gt;In Virtuoso, we see &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xb3bdcc0&quot;&gt;SPARQL&lt;/a&gt; as reducing to SQL.  Any RDF-oriented database-engine or query-optimization feature is accessed via SQL.  Thus, if we address run-time-recursion in the Virtuoso query engine, this becomes, &lt;i&gt;ipso facto&lt;/i&gt;, an SQL feature.  Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.&lt;/p&gt;

&lt;h2&gt; SQL and Transitivity &lt;/h2&gt;

&lt;p&gt;We will here look at some simple social network queries.  A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e., &lt;code&gt;SELECT&lt;/code&gt; in another &lt;code&gt;SELECT&lt;/code&gt;&amp;#39;s &lt;code&gt;FROM&lt;/code&gt; clause, with a &lt;code&gt;TRANSITIVE&lt;/code&gt; clause.&lt;/p&gt;

&lt;p&gt;Consider the data:&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;CREATE TABLE &amp;quot;knows&amp;quot; 
   (&amp;quot;p1&amp;quot; INT, 
    &amp;quot;p2&amp;quot; INT, 
    PRIMARY KEY (&amp;quot;p1&amp;quot;, &amp;quot;p2&amp;quot;)
   );
ALTER INDEX &amp;quot;knows&amp;quot; 
   ON &amp;quot;knows&amp;quot; 
   PARTITION (&amp;quot;p1&amp;quot; INT);
CREATE INDEX &amp;quot;knows2&amp;quot; 
   ON &amp;quot;knows&amp;quot; (&amp;quot;p2&amp;quot;, &amp;quot;p1&amp;quot;) 
   PARTITION (&amp;quot;p2&amp;quot; INT);
&lt;/code&gt;
 &lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We represent a social network with the many-to-many relation &amp;quot;knows&amp;quot;.  The persons are identified by integers.&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;INSERT INTO &amp;quot;knows&amp;quot; VALUES (1, 2);
INSERT INTO &amp;quot;knows&amp;quot; VALUES (1, 3);
INSERT INTO &amp;quot;knows&amp;quot; VALUES (2, 4);&lt;/code&gt;
 &lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot; 
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;k&amp;quot;.&amp;quot;p1&amp;quot; = 1;&lt;/code&gt;&lt;/pre&gt;&lt;/blockquote&gt;

&lt;p&gt;We obtain the result:&lt;/p&gt;

&lt;blockquote&gt;
&lt;table width=&quot;100&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;The operation is reversible:&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot; 
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot; = 4;
&lt;/code&gt;
 &lt;/pre&gt;

&lt;table width=&quot;100&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;Since now we give &lt;i&gt;p2&lt;/i&gt;, we traverse from &lt;i&gt;p2&lt;/i&gt; towards &lt;i&gt;p1&lt;/i&gt;. The result set states that 4 is known by 2 and 2 is known by 1.&lt;/p&gt;

&lt;p&gt;To see what would happen if &lt;i&gt;x&lt;/i&gt; knowing &lt;i&gt;y&lt;/i&gt; also meant &lt;i&gt;y&lt;/i&gt; knowing &lt;i&gt;x&lt;/i&gt;, one could write:&lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot; 
	    FROM (SELECT 
                  &amp;quot;p1&amp;quot;, 
                  &amp;quot;p2&amp;quot; 
               FROM &amp;quot;knows&amp;quot; 
               UNION ALL 
                  SELECT 
                     &amp;quot;p2&amp;quot;, 
                     &amp;quot;p1&amp;quot; 
                  FROM &amp;quot;knows&amp;quot;
              ) &amp;quot;k2&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot; = 4;&lt;/code&gt;
 &lt;/pre&gt;

&lt;table width=&quot;100&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;


&lt;p&gt;Now, since we know that 1 and 4 are related, we can ask how they are related.&lt;/p&gt;
&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT * 
   FROM (SELECT 
            TRANSITIVE 
               T_IN (1) 
               T_OUT (2) 
               T_DISTINCT
               &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot;, 
            T_STEP (1) AS &amp;quot;via&amp;quot;, 
            T_STEP (&amp;#39;step_no&amp;#39;) AS &amp;quot;step&amp;quot;, 
            T_STEP (&amp;#39;path_id&amp;#39;) AS &amp;quot;path&amp;quot; 
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;p1&amp;quot; = 1 
      AND &amp;quot;p2&amp;quot; = 4;&lt;/code&gt;
 &lt;/pre&gt;

&lt;table width=&quot;250&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p1&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;via&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;step&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;path&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;


&lt;p&gt;The two first columns are the ends of the path.  The next column is the person that is a step on the path.  The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e., &lt;i&gt;p1&lt;/i&gt;, has number 0.  Since there can be multiple solutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.&lt;/p&gt;

&lt;p&gt;For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn search result views can be written as: &lt;/p&gt;

&lt;blockquote&gt;
 &lt;pre&gt;&lt;code&gt;SELECT p2, 
      dist, 
      (SELECT 
          COUNT (*) 
          FROM &amp;quot;knows&amp;quot; &amp;quot;c&amp;quot; 
          WHERE &amp;quot;c&amp;quot;.&amp;quot;p1&amp;quot; = &amp;quot;k&amp;quot;.&amp;quot;p2&amp;quot;
      ) 
   FROM (SELECT 
            TRANSITIVE t_in (1) t_out (2) t_distinct &amp;quot;p1&amp;quot;, 
            &amp;quot;p2&amp;quot;, 
            t_step (&amp;#39;step_no&amp;#39;) AS &amp;quot;dist&amp;quot;
         FROM &amp;quot;knows&amp;quot;
        ) &amp;quot;k&amp;quot; 
   WHERE &amp;quot;p1&amp;quot; = 1 
   ORDER BY &amp;quot;dist&amp;quot;, 3 DESC;&lt;/code&gt;
 &lt;/pre&gt;


&lt;table width=&quot;150&quot;&gt;
&lt;tr&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;p2&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;dist&lt;/th&gt;
    &lt;th align=&quot;center&quot; width=&quot;50&quot;&gt;aggregate&lt;/th&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;3&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;1&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
    &lt;td align=&quot;center&quot;&gt;4&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;2&lt;/td&gt;
    &lt;td align=&quot;center&quot;&gt;0&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;


&lt;h2&gt;How?&lt;/h2&gt;

&lt;p&gt;The queries shown above work on Virtuoso v6.  When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency-tolerant.  By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.&lt;/p&gt;

&lt;p&gt;Furthermore, for queries where both ends of a path are given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts, and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels, and we stop when we find the first intersection.  Thus, we reach 2 * 30^3 = 54,000 nodes, and not 30^6 = 729,000,000 nodes.&lt;/p&gt;

&lt;p&gt;Writing a generic database driven graph traversal framework on the application side, say in Java over &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0xa8a9ef8&quot;&gt;JDBC&lt;/a&gt;, would easily be over a thousand lines. This is much more work than can be justified just for a one-off, ad-hoc query.  Besides, the traversal order in such a case could not be optimized by the DBMS.&lt;/p&gt;

&lt;h2&gt;Next&lt;/h2&gt; 

&lt;p&gt;In a future &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0xb526a40&quot;&gt;blog&lt;/a&gt; post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, traveling salesman, the 8 queens chess problem, etc.  There are lots of switches for controlling different parameters of the traversal.  This is just the beginning.  I will also give examples of the use of this in SPARQL.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>Configuring Virtuoso for Benchmarking</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-08-25#1419</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1419#comments</comments><pubDate>Mon, 25 Aug 2008 14:06:11 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-08-25T15:29:06.000036-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;Configuring Virtuoso for Benchmarking&lt;/div&gt;
&lt;p&gt;I will here summarize what should be known about running benchmarks with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xc152cf0&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Physical Memory&lt;/h2&gt;

&lt;p&gt;For 8G RAM, in the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt;

&lt;blockquote&gt;
&lt;code&gt;
[Parameters]&lt;br /&gt;
...&lt;br /&gt;
NumberOfBuffers = 550000
&lt;/code&gt;
&lt;/blockquote&gt; 
&lt;p&gt;For 16G RAM, double thisâ&lt;/p&gt;

&lt;blockquote&gt;
&lt;code&gt;
[Parameters]&lt;br /&gt;
...&lt;br /&gt;
NumberOfBuffers = 1100000
&lt;/code&gt;
&lt;/blockquote&gt; 

&lt;h2&gt;Transaction Isolation&lt;/h2&gt;
&lt;p&gt;For most cases, certainly all &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xb7ba270&quot;&gt;RDF&lt;/a&gt; cases, &lt;i&gt;Read Committed&lt;/i&gt; should be the default transaction isolation.  In the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; 
&lt;blockquote&gt;
&lt;code&gt;
[Parameters]&lt;br /&gt;
...&lt;br /&gt;
DefaultIsolation = 2 
&lt;/code&gt;
&lt;/blockquote&gt; 

&lt;h2&gt;Multiuser Workload&lt;/h2&gt;

&lt;p&gt;If &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x1a40f308&quot;&gt;ODBC&lt;/a&gt;, &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1e003cf8&quot;&gt;JDBC&lt;/a&gt;, or similarly connected client applications are used, there must be more &lt;code&gt;ServerThreads&lt;/code&gt; available than there will be client connections.  In the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; 
&lt;blockquote&gt;
&lt;code&gt; 
[Parameters]&lt;br /&gt;
...&lt;br /&gt;
ServerThreads = 100
&lt;/code&gt;
&lt;/blockquote&gt; 

&lt;p&gt;With web clients (unlike ODBC, JDBC, or similar clients), it may be justified to have fewer &lt;code&gt;ServerThreads&lt;/code&gt; than there are concurrent clients.  The &lt;code&gt;MaxKeepAlives&lt;/code&gt; should be the maximum number of expected web clients.  This can be more than the &lt;code&gt;ServerThreads&lt;/code&gt; count.  In the &lt;code&gt;[HTTPServer]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; 
&lt;blockquote&gt;
&lt;code&gt; 
[HTTPServer]&lt;br /&gt;
...&lt;br /&gt;
ServerThreads    = 100 &lt;br /&gt;
MaxKeepAlives    = 1000 &lt;br /&gt;
KeepAliveTimeout = 10
&lt;/code&gt;
&lt;/blockquote&gt; 

&lt;p&gt;
&lt;i&gt;&lt;b&gt;Note&lt;/b&gt; â The &lt;code&gt;[HTTPServer] ServerThreads&lt;/code&gt; are taken from the total pool made available by the &lt;code&gt;[Parameters] ServerThreads&lt;/code&gt;.  Thus, the &lt;code&gt;[Parameters] ServerThreads&lt;/code&gt; should always be at least as large as (and is best set greater than) the &lt;code&gt;[HTTPServer] ServerThreads&lt;/code&gt;, and if using the closed-source Commercial Version, should not exceed the licensed thread count.&lt;/i&gt;
&lt;/p&gt; 

&lt;h2&gt;Disk Use&lt;/h2&gt;

&lt;p&gt;The basic rule is to use one stripe (file) per distinct physical device (not per file system), using no RAID.  For example, one might stripe a database over 6 files (6 physical disks), with an initial size of 60000 pages (the files will grow as needed).  &lt;/p&gt;

&lt;p&gt;For the above described example, in the &lt;code&gt;[Database]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; 
&lt;blockquote&gt;
&lt;code&gt;
[Database]&lt;br /&gt;
...&lt;br /&gt;
Striping = 1&lt;br /&gt;
MaxCheckpointRemap 	= 2000000 
&lt;/code&gt;
&lt;/blockquote&gt; 

&lt;p&gt;â and in the &lt;code&gt;[Striping]&lt;/code&gt; stanza, on one line per &lt;code&gt;SegmentName&lt;/code&gt;, set â&lt;/p&gt; 
&lt;blockquote&gt;
&lt;code&gt;
[Striping]&lt;br /&gt;
...&lt;br /&gt;
Segment1 = 60000 , /virtdev/db/virt-seg1.db = q1 , /data1/db/virt-seg1-str2.db = q2 , /data2/db/virt-seg1-str3.db = q3 , /data3/db/virt-seg1-str4.db = q4 , /data4/db/virt-seg1-str5.db = q5 , /data5/db/virt-seg1-str6.db = q6&lt;/code&gt;
&lt;/blockquote&gt; 

&lt;p&gt;As can be seen here, each file gets a background IO thread (the &lt;code&gt;= q&lt;i&gt;xxx&lt;/i&gt;&lt;/code&gt; clause).  It should be noted that all files on the same physical device should have the same &lt;code&gt;q&lt;i&gt;xxx&lt;/i&gt;&lt;/code&gt; value.  This is not directly relevant to the benchmarking scenario above, because we have only one file per device, and thus only one file per IO queue.&lt;/p&gt;

&lt;h2&gt;
&lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xc8b97c0&quot;&gt;SQL&lt;/a&gt; Optimization&lt;/h2&gt;

&lt;p&gt;If queries have lots of joins but access little &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x193b2fa8&quot;&gt;data&lt;/a&gt;, as with the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1b283ca0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt;, the SQL compiler must be told not to look for better plans if the best plan so far is quicker than the compilation time expended so far.  Thus, in the &lt;code&gt;[Parameters]&lt;/code&gt; stanza of &lt;code&gt;virtuoso.ini&lt;/code&gt;, set â&lt;/p&gt; 
&lt;blockquote&gt;
&lt;code&gt;
[Parameters]&lt;br /&gt;
...&lt;br /&gt;
StopCompilerWhenXOverRunTime = 1
&lt;/code&gt;
&lt;/blockquote&gt; 
&lt;/div&gt;</description></item><item><title>BSBM With Triples and Mapped Relational Data</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-08-06#1410</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1410#comments</comments><pubDate>Wed, 06 Aug 2008 19:41:50 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-08-06T16:29:44.000003-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;BSBM With Triples and Mapped Relational Data&lt;/div&gt;
&lt;p&gt;The special contribution of the &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id10039db0&quot;&gt;Berlin SPARQL Benchmark&lt;/a&gt; (&lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id106b2538&quot;&gt;BSBM&lt;/a&gt;) to the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id101a75f8&quot;&gt;RDF&lt;/a&gt; world is to raise the question of doing OLTP with &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xae54170&quot;&gt;RDF&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Of course, here we immediately hit the question of comparisons with relational databases.  To this effect, &lt;a href=&quot;http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html&quot; id=&quot;link-id0x1e847b08&quot;&gt;BSBM&lt;/a&gt; also specifies a relational schema and can generate the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id1206c378&quot;&gt;data&lt;/a&gt; as either triples or &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id1667f040&quot;&gt;SQL&lt;/a&gt; inserts.&lt;/p&gt;

&lt;p&gt;The benchmark effectively simulates the case of exposing an existing &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id10a93518&quot;&gt;RDBMS&lt;/a&gt; as RDF.  &lt;a href=&quot;http://www.openlinksw.com/dataspace/organization/openlink#this&quot; id=&quot;link-id13e46d80&quot;&gt;OpenLink Software&lt;/a&gt; calls this &lt;i&gt;RDF Views&lt;/i&gt;.  &lt;a href=&quot;http://dbpedia.org/resource/Oracle_Database&quot; id=&quot;link-id12027578&quot;&gt;Oracle&lt;/a&gt; is beginning to call this &lt;i&gt;semantic covers&lt;/i&gt;.  The &lt;a href=&quot;http://www.w3.org/2005/Incubator/rdb2rdf/&quot; id=&quot;link-id161dc678&quot;&gt;RDB2RDF XG&lt;/a&gt;, a W3C incubator group, has been active in this area since Spring, 2008.&lt;/p&gt;

&lt;h3&gt;But why an OLTP workload with RDF to begin with?&lt;/h3&gt;

&lt;p&gt;We believe this is relevant because RDF promises to be the interoperability factor between potentially all of traditional IS.  If &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x1e7119d8&quot;&gt;data&lt;/a&gt; is online for human consumption, it may be online via a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id106a8908&quot;&gt;SPARQL&lt;/a&gt; end-point as well.  The economic justification will come from discoverability and from applications integrating multi-source structured data.  Online shopping is a fine use case.&lt;/p&gt;

&lt;p&gt;Warehousing all the world&amp;#39;s publishable data as RDF is not our first preference, nor would it be the publisher&amp;#39;s.  Considerations of duplicate infrastructure and maintenance are reason enough.  Consequently, we need to show that mapping can outperform an RDF warehouse, which is what we&amp;#39;ll do here.&lt;/p&gt;

&lt;h3&gt;What We Got &lt;/h3&gt;

&lt;p&gt;First, we found that &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400&quot; id=&quot;link-id150ea748&quot;&gt;making the query plan took much too long&lt;/a&gt; in proportion to the run time.  With BSBM this is an issue because the queries have lots of joins but access relatively little data.  So we made a faster compiler and along the way retouched the cost model a bit.&lt;/p&gt;

&lt;p&gt;But the really interesting part with BSBM is mapping relational data to RDF.  For us, BSBM is a great way of showing that mapping can outperform even the best triple store.  A relational row store is as good as unbeatable with the query mix.  And when there is a clear mapping, there is no reason the &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xae5aff0&quot;&gt;SPARQL&lt;/a&gt; could not be directly translated.&lt;/p&gt;

&lt;p&gt;If Chris Bizer et al launched the mapping ship, we will be the ones to pilot it to harbor!&lt;/p&gt;

&lt;p&gt;We filled two &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id12dbdc70&quot;&gt;Virtuoso&lt;/a&gt; instances with a BSBM200000 data set, for 100M triples.  One was filled with physical triples; the other was filled with the equivalent relational data plus mapping to triples.  Performance figures are given in &amp;quot;query mixes per hour&amp;quot;.  (An update or follow-on to this post will provide elapsed times for each test run.)&lt;/p&gt;

&lt;p&gt;With the unmodified benchmark we got:&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt;1297 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt;&lt;b&gt;3144 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;In both cases, most of the time was spent on Q6, which looks for products with one of three words in the label.  We altered Q6  to use text index for the mapping, and altered the databases accordingly. (There is no such thing as an e-commerce site without a text index, so we are amply justified in making this change.)&lt;/p&gt;

&lt;p&gt;The following were measured on the second run of a 100 query mix series, single test driver, warm cache.&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt; 5746 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt; &lt;b&gt;7525 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;We then ran the same with 4 concurrent instances of the test driver. The qmph here is 400 / the longest run time.&lt;/p&gt;
&lt;blockquote&gt;
&lt;table&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Physical Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
    &lt;td&gt; 19459 qmph&lt;/td&gt;
  &lt;/tr&gt;
&lt;tr&gt;
   &lt;td&gt;&lt;i&gt;Mapped Triples:&lt;/i&gt;
   &lt;/td&gt;
    &lt;td&gt;Â  Â &lt;/td&gt;
   &lt;td&gt; &lt;b&gt;24531 qmph&lt;/b&gt;
   &lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with 8G RAM.  The concurrent throughputs are a little under 4 times the single thread throughput, which is normal for SMP due to memory contention.  The numbers do not evidence significant overhead from thread synchronization.&lt;/p&gt;

&lt;p&gt;The query compilation represents about 1/3 of total server side CPU. In an actual online application of this type, queries would be parameterized, so the throughputs would be accordingly higher.  We used the &lt;code&gt;StopCompilerWhenXOverRunTime = 1&lt;/code&gt; option here to cut needless compiler overhead, the queries being straightforward enough.&lt;/p&gt;

&lt;p&gt;We also see that the advantage of mapping can be further increased by more compiler optimizations, so we expect in the end mapping will lead RDF warehousing by a factor of 4 or so.&lt;/p&gt;

&lt;h3&gt;Suggestions for BSBM&lt;/h3&gt;

&lt;ul&gt;
 &lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Reporting Rules.&lt;/b&gt; The benchmark spec should specify a form for disclosure of test run data, TPC style.  This includes things like configuration parameters and exact text of queries.  There should be accepted variants of query text, as with the TPC.&lt;/p&gt;
 &lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Multiuser operation.&lt;/b&gt;  The test driver should get a stream number as parameter, so that each client makes a different query sequence. Also, disk performance in this type of benchmark can only be reasonably assessed with a naturally parallel multiuser workload.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Add business intelligence.&lt;/b&gt;  SPARQL has aggregates now, at least with &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id11a25ac0&quot;&gt;Jena&lt;/a&gt; and &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xb003180&quot;&gt;Virtuoso&lt;/a&gt;, so let&amp;#39;s use these.  The BSBM business intelligence metric should be a separate metric off the same data.  Adding synthetic sales figures would make more interesting queries possible.  For example, producing recommendations like &amp;quot;customers who bought this also bought xxx.&amp;quot;&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;For the SPARQL community&lt;/b&gt;, BSBM sends the message that one ought to support parameterized queries and stored procedures.  This would be a &lt;a href=&quot;http://www.w3.org/TR/rdf-sparql-protocol/&quot; id=&quot;link-id109e2448&quot;&gt;SPARQL protocol&lt;/a&gt; extension; the SPARUL syntax should also have a way of calling a procedure.  Something like &lt;code&gt;select proc (??, ??)&lt;/code&gt; would be enough, where &lt;code&gt;??&lt;/code&gt; is a parameter marker, like &lt;code&gt;?&lt;/code&gt; in &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id13febf48&quot;&gt;ODBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id120416a8&quot;&gt;JDBC&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
  &lt;p&gt;
    &lt;b&gt;Add transactions.&lt;/b&gt;Especially if we are contrasting mapping vs. storing triples, having an update flow is relevant.  In practice, this could be done by having the test driver send web service requests for order entry and the SUT could implement these as updates to the triples or a mapped relational store.  This could use stored procedures or logic in an app server.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Comments on Query Mix&lt;/h3&gt;

&lt;p&gt;The time of most queries is less than linear to the scale factor.  Q6 is an exception if it is not implemented using a text index.  Without the text index, Q6 will inevitably come to dominate query time as the scale is increased, and thus will make the benchmark less relevant at larger scales.&lt;/p&gt;

&lt;h2&gt;Next&lt;/h2&gt;

&lt;p&gt;We include the sources of our RDF view definitions and other material for running BSBM with our forthcoming Virtuoso Open Source 5.0.8 release.  This also includes all the query optimization work done for BSBM.  This will be available in the coming days.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>The DARQ Matter of Federation</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1381</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1381#comments</comments><pubDate>Mon, 09 Jun 2008 14:02:19 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-06-11T15:15:14-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;The DARQ Matter of Federation&lt;/div&gt;
&lt;p&gt;Astronomers propose that the universe is held together, so to speak, by the gravity of invisible &amp;quot;dark matter&amp;quot; spread in interstellar and intergalactic space.&lt;/p&gt;
&lt;p&gt;For the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0x19dbf410&quot;&gt;data&lt;/a&gt; web, it will be held together by federation, also an invisible factor. As in Minkowski space, so in &lt;a href=&quot;http://dbpedia.org/resource/Cyberspace&quot; id=&quot;link-id0x9fc13ff8&quot;&gt;cyberspace&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;To take the astronomical analogy further, putting too much visible stuff in one place makes a black hole, whose chief properties are that it is very heavy, can only get heavier and that nothing comes out.&lt;/p&gt;
&lt;p&gt;
  &lt;a href=&quot;http://darq.sourceforge.net/&quot; id=&quot;link-id0x1d06bd88&quot;&gt;DARQ&lt;/a&gt; is Bastian Quilitz&amp;#39;s federated extension of the &lt;a href=&quot;http://jena.sourceforge.net/&quot; id=&quot;link-id0x1cf28f70&quot;&gt;Jena&lt;/a&gt; &lt;a href=&quot;http://jena.sourceforge.net/ARQ/&quot; id=&quot;link-id0x1cba22c8&quot;&gt;ARQ&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x171c7dc8&quot;&gt;SPARQL&lt;/a&gt; processor. It has existed for a while and was also presented at &lt;a href=&quot;http://www.eswc2008.org/&quot; id=&quot;link-id0x1ed53cd0&quot;&gt;ESWC2008&lt;/a&gt;. There is also SPARQL FED from Andy Seaborne, an explicit means of specifying which end point will process which fragment of a distributed SPARQL query. Still, for federation to deliver in an open, decentralized world, it must be transparent. For a specific application, with a predictable workload, it is of course OK to partition queries explicitly.&lt;/p&gt;
&lt;p&gt;Bastian had split &lt;a href=&quot;http://dbpedia.org/resource/DBpedia&quot; id=&quot;link-id0x1ce846c0&quot;&gt;DBpedia&lt;/a&gt; among five &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x1cad0640&quot;&gt;Virtuoso&lt;/a&gt; servers and was querying this set with DARQ. The end result was that there was a rather frightful cost of federation as opposed to all the data residing in a single Virtuoso. The other result was that if selectivity of predicates was not correctly guessed by the federation engine, the proposition was a non-starter. With correct join order it worked, though.&lt;/p&gt;
&lt;p&gt;Yet, we really want federation. Looking further down the road, we simply must make federation work. This is just as necessary as running on a server cluster for mid-size workloads.&lt;/p&gt;
&lt;p&gt;Since we are convinced of the cause, let&amp;#39;s talk about the means.&lt;/p&gt;
&lt;p&gt;For DARQ as it now stands, there&amp;#39;s probably an order of magnitude or even more to gain from a couple of simple tricks. If going to a SPARQL end point that is not the outermost in the loop join sequence, batch the requests together in one &lt;a href=&quot;http://dbpedia.org/resource/Hypertext_Transfer_Protocol&quot; id=&quot;link-id0x19a48280&quot;&gt;HTTP&lt;/a&gt;/1.1 message. So, if the query is &amp;quot;get me my friends living in cities of over a million people,&amp;quot; there will be the fragment &amp;quot;get city where x lives&amp;quot; and later &amp;quot;ask if population of x greater than 1000000&amp;quot;. If I have 100 friends, I send the 100 requests in a batch to each eligible server.&lt;/p&gt;
&lt;p&gt;Further, if running against a server of known brand, use a client-server connection and prepared statements with array parameters. This can well improve the processing speed at the remote end point by another order of magnitude. This gain may however not be as great as the latency savings from message batching. We will provide a sample of how to do this with Virtuoso over &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x1cf18278&quot;&gt;JDBC&lt;/a&gt; so Bastian can try this if interested.&lt;/p&gt;
&lt;p&gt;These simple things will give a lot of mileage and may even decide whether federation is an option in specific applications. For the open web however, these measures will not yet win the day.&lt;/p&gt;
&lt;p&gt;When federating &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1cf7d0e8&quot;&gt;SQL&lt;/a&gt;, colocation of data is sort of explicit. If two tables are joined and they are in the same source, then the join can go to the source. For SPARQL this is also so but with a twist:&lt;/p&gt;
&lt;p&gt;If a foaf:Person is found on a given server, this does not mean that the Person&amp;#39;s geek code or email hash will be on the same server. Thus &lt;code&gt;{?p name &amp;quot;Johnny&amp;quot; . ?p geekCode ?g . ?p emailHash ?h }&lt;/code&gt; does not necessarily denote a colocated join if many servers serve items of the vocabulary.&lt;/p&gt;
&lt;p&gt;However, in most practical cases, for obtaining a rapid answer, treating this as a colocated fragment will be appropriate. Thus, it may be necessary to be able to declare that geek codes will be assumed colocated with names. This will save a lot of message passing and offer decent, if not theoretically total recall. For search style applications, starting with such assumptions will make sense. If nothing is found, then we can partition each join step separately for the unlikely case that there were a server that gave geek codes but not names.&lt;/p&gt;
&lt;p&gt;For Virtuoso, we find that a federated query&amp;#39;s asynchronous, parallel evaluation model is not so different from that on a local cluster. So the cluster version could have the option of federated query. The difference is that a cluster is local and tightly coupled and predictably partitioned but a federated setting is none of these.&lt;/p&gt;
&lt;p&gt;For description, we would take DARQ&amp;#39;s description model and maybe extend it a little where needed. Also we would enhance the protocol to allow just asking for the query cost estimate given a query with literals specified. We will do this eventually.&lt;/p&gt;
&lt;p&gt;We would like to talk to Bastian about large improvements to DARQ, specially when working with Virtuoso. We&amp;#39;ll see.&lt;/p&gt;
&lt;p&gt;Of course, one mode of federating is the crawl-as-you-go approach of the Virtuoso &lt;a href=&quot;http://virtuoso.openlinksw.com/Whitepapers/html/VirtSpongerWhitePaper.html&quot; id=&quot;link-id0x1e163140&quot;&gt;Sponger&lt;/a&gt;. This will bring in fragments following seeAlso or sameAs declarations or other references. This will however not have the recall of a warehouse or federation over well described SPARQL end-points. But up to a certain volume it has the speed of local storage.&lt;/p&gt;
&lt;p&gt;The emergence of voiD (Vocabulary of Interlinked Data) is a step in the direction of making federation a reality. There is &lt;a href=&quot;http://www.openlinksw.com/weblog/oerling/?id=1377&quot; id=&quot;link-id1109a4c8&quot;&gt;a separate post&lt;/a&gt; about this.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>WWW 2008</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1348</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1348#comments</comments><pubDate>Tue, 29 Apr 2008 14:37:20 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-29T13:35:23-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;WWW 2008&lt;/div&gt;
&lt;p&gt;Following my return from WWW 2008 in &lt;a href=&quot;http://www2008.org/&quot; id=&quot;link-id0x9ff7d5d0&quot;&gt;Beijing&lt;/a&gt;, I will write a series of &lt;a href=&quot;http://dbpedia.org/resource/Blog&quot; id=&quot;link-id0x9e4a7650&quot;&gt;blog&lt;/a&gt; posts discussing diverse topics that were brought up in presentations and conversations during the week.&lt;/p&gt;
&lt;a href=&quot;http://dbpedia.org/resource/Linked_Data&quot; id=&quot;link-id0x9e7ae398&quot;&gt;Linked data&lt;/a&gt; was our main interest in the conference and there was a one day workshop on this, unfortunately overlapping with a day of W3C Advisory Committee meetings.  Hence Tim Berners-Lee, one of the chairs of the workshop, could not attend for most of the day.  Still, he was present to say that &amp;quot;&lt;a href=&quot;http://community.linkeddata.org/dataspace/organization/lod#this&quot; id=&quot;link-id0xa287d38&quot;&gt;Linked open data&lt;/a&gt; is the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x15372940&quot;&gt;semantic web&lt;/a&gt; and the web done as it ought to be done.&amp;quot;
&lt;p&gt;For my part, I will draw some architecture conclusions from the different talks and extrapolate about the requirements on database platforms for linked data.&lt;/p&gt;
&lt;p&gt;Chris Bizer predicted that 2008 would be the year of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xa1454c58&quot;&gt;data&lt;/a&gt; web search, if 2007 was the year of &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xa0f73c50&quot;&gt;SPARQL&lt;/a&gt;.  This may be the case, as linked data is now pretty much a reality and the questions of discovery become prevalent.  There was a birds-of-a-feather session on this and I will make some comments on what we intend to explore in bridging between the text index based semantic web search engines and SPARQL.&lt;/p&gt;
&lt;p&gt;Andy Seaborne convened a birds-of-a-feather session on the future of SPARQL.  Many of the already anticipated and implemented requirements were confirmed and a few were introduced.  A separate blog post will discuss these further.&lt;/p&gt;
&lt;p&gt;From the various discussions held throughout the conference, we conclude that plug-and-play operation with the major semantic web frameworks of Jena, Sesame, and Redland, is our major immediate-term deliverable.  Our efforts in this direction thus far are insufficient and we will next have these done with the right supervision and  proper interop testing.  The issues are fortunately simple but doing things totally right require some small server side support and some &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0xa5d4d5b8&quot;&gt;JDBC&lt;/a&gt;/&lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x9dc28d10&quot;&gt;ODBC&lt;/a&gt; tweaks, so to the interested, we advise to wait for an update to be published on this blog.&lt;/p&gt;
&lt;p&gt;I further had a conversation with Andy Seaborne about using Jena reasoning capabilities with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0xa2754050&quot;&gt;Virtuoso&lt;/a&gt; and generally the issues of &amp;quot;impedance mismatch&amp;quot; between reasoning and typical database workloads. More on this later.   
&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>SPARQL End Point Self Description</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-11-21#1087</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1087#comments</comments><pubDate>Tue, 21 Nov 2006 14:22:53 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:53:46.000002-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;SPARQL End Point Self Description&lt;/div&gt;
&lt;p&gt;
  &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0x18b5cde0&quot;&gt;SPARQL&lt;/a&gt; End Point Self Description&lt;/p&gt;
&lt;p&gt;I was at the ISWC 2006 conference a week back. One of the items discussed there, at least informally, was the topic of SPARQL end point discovery. I have below put together a summary of points that were discussed and of my own views on their possible resolution.&lt;/p&gt;
&lt;p&gt;This is intended as a start for conversation and as a summary of ideas.&lt;/p&gt;
&lt;h4&gt;Use Cases&lt;/h4&gt;
&lt;p&gt;Self-description of end points may serve at least the following purposes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Query composition - A client must know the capabilities of a server in order to compose suitable queries. &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0x171af558&quot;&gt;ODBC&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0xdf3ec80&quot;&gt;JDBC&lt;/a&gt; have fairly extensive metadata about each DBMS&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1b5a6ec8&quot;&gt;SQL&lt;/a&gt; dialect and other properties. These may in part serve as a model.&lt;/li&gt;
&lt;li&gt;Content Discovery - What is the &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xc518ac8&quot;&gt;data&lt;/a&gt; about? What graphs does the end point contain?&lt;/li&gt;
&lt;li&gt;Query planning - When making an execution plan for federated queries, it is almost necessary to know the cardinalities of predicates and other things for evaluating join orders and the like.&lt;/li&gt;
&lt;li&gt;Query targeting - Does it make sense to send a particular query to this end point? The answer may contain things like whether the query could be parsed in the first place, whether it is known to be identically empty, estimated computation time, estimated count of results, optionally a platform dependent query plan.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will look at each one in turn.&lt;/p&gt;
&lt;h4&gt;End Point Data and Capabilities&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Server software name and version&lt;/li&gt;
&lt;li&gt;Must the predicate be constant? Must a rdfs:type be given for a subject? Must a graph be given? Can the graph be a variable known at execution time only?&lt;/li&gt;
&lt;li&gt;List of supported built-in SPARQL functions.&lt;/li&gt;
&lt;li&gt;Language extensions - For example, whether there is a full text match predicate.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Content Discovery&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Name and general description of the purpose of the end point.&lt;/li&gt;
&lt;li&gt;What organization/individual is maintaining the end point?&lt;/li&gt;
&lt;li&gt;Contact for technical support, legal or administrative matters. Support and webmaster.&lt;/li&gt;
&lt;li&gt;Ontologies used. This could be a list of graphs, each with a list of ontologies describing the data therein. Each graph would be listed with a rough estimate of size expressed in triples.&lt;/li&gt;
&lt;li&gt;Topic - Each graph/ontology pair could have a number of identifiers drawn from standard taxonomies. Examples would be the Latin names of geni and species for biology, the HS code for customs, ISO code for countries, various industry specific classifications of goods and services.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Query Planning&lt;/h4&gt;
&lt;p&gt;The end point should give a ballpark cardinality for the following combinations of G, S, P, O.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;G&lt;/li&gt;
&lt;li&gt;G, P&lt;/li&gt;
&lt;li&gt;G, P, O&lt;/li&gt;
&lt;li&gt;G, S&lt;/li&gt;
&lt;li&gt;G, S, P&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Based on our experience, these are the most interesting questions but for completeness, the entry point might offer an API allowing specifying a constant or wildcard for each of the four parts of a quad. If the &lt;a href=&quot;http://dbpedia.org/resource/Information&quot; id=&quot;link-id0x1891a2e0&quot;&gt;information&lt;/a&gt; is not readily available, &amp;quot;unknown&amp;quot; could be returned, together with the count of triples in the whole end point or the graph, if the graph is specified. Even if the end point does not support real time sampling of data for cardinality estimates, it would at least have an idea of the count of triples per graph, which is still far better than nothing.&lt;/p&gt;
&lt;h4&gt;Query Feasibility&lt;/h4&gt;
&lt;p&gt;Given the full SPARQL request, the end point could return the following data, without executing the query itself.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Syntax errors vs. parsed successfully?&lt;/li&gt;
&lt;li&gt;Are there graph, predicate or subject literals which do not exist in this end point? Does this cause the query result to always be empty? What are these?&lt;/li&gt;
&lt;li&gt;How many results are expected, according to the SPARQL compiler cost model? This is a row count, if the query is a construct or describe query, this is the count of rows that will go as input to the construct/describe.&lt;/li&gt;
&lt;li&gt;What is the execution time, as guessed by the SPARQL compiler cost model?&lt;/li&gt;
&lt;li&gt;Execution plan, in whatever implementation specific, in principle human readable format.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All these elements would be optional.&lt;/p&gt;
&lt;p&gt;This somewhat overlaps with the optimization questions but it may still be the case that it is more efficient to support a special interface for the optimization related questions.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>More Thoughts on ORDBMS Clients, .NET and RDF</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-17#1008</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1008#comments</comments><pubDate>Mon, 17 Jul 2006 12:16:02 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:30.000001-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;More Thoughts on ORDBMS Clients, .NET and RDF&lt;/div&gt;
&lt;p&gt;Continuing on from &lt;a href=&quot;http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1002&quot; id=&quot;link-id1064f0c8&quot;&gt;the previous post&lt;/a&gt;... If Microsoft opens the right interfaces for independent developers, we see many exciting possibilities for using &lt;a href=&quot;http://msdn2.microsoft.com/en-us/data/aa937699.aspx&quot; id=&quot;link-id10f3ab60&quot;&gt;ADO.NET&lt;/a&gt; 3.0 with &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x171ad660&quot;&gt;Virtuoso&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Microsoft quite explicitly states that their thrust is to decouple the client side representation of &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xdaf01b0&quot;&gt;data&lt;/a&gt; as .NET objects from the relational schema on the database. This is a worthy goal.&lt;/p&gt;
&lt;p&gt;But we can also see other possible applications of the technology when we move away from strictly relational back ends. This can go in two directions: Towards object oriented database (OODBMS) and towards making applications for the &lt;a href=&quot;http://dbpedia.org/resource/Semantic_Web&quot; id=&quot;link-id0x175fa2f0&quot;&gt;semantic web&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the OODBMS direction, we could equate Virtuoso table hierarchies with .NET classes and create a tighter coupling between client and database, going as it were in the other direction from Microsoft&amp;#39;s intended decoupling. For example, we could do typical OODBMS tricks such as pre-fetch of objects based on storage clustering. The simplest case of this is like virtual memory, where the request for one byte brings in the whole page or group of pages. The basic idea is that what is created together probably gets used together and if all objects are modeled as subclasses of (sub-tables) of a common superclass, then, regardless of instance type, what is created together (has consecutive IDs) will indeed tend to cluster on the same page. These tricks can deliver good results in very navigational applications like GIS or CAD. But these are rather specialized things and we do not see OODBMS making any great comeback.&lt;/p&gt;
&lt;p&gt;But what is more interesting and more topical in the present times is making clients for the &lt;a href=&quot;http://dbpedia.org/resource/Resource_Description_Framework&quot; id=&quot;link-id0xc58f9f8&quot;&gt;RDF&lt;/a&gt; world. There, the OWL ontology could be used to make the .NET classes and the DBMS could, when returning URIs serving as subjects of triple include specified predicates on these subjects, enough to allow instantiating .NET instances as &amp;quot;proxies&amp;quot; of these RDF objects. Of course, only predicates for which the client has a representation are relevant, thus some client-server handshake is needed at the start. What data could be pre-fetched is like the intersection of a concise bounded description and what the client has classes for. The rest of the mapping would be very simple, with IRIs becoming pointers, multi-valued predicates lists, and so on. IRIs for which the RDF type is not known or inferable could be left out or represented as a special class with name-value pairs for its attributes, same with blank nodes.&lt;/p&gt;
&lt;p&gt;In this way, .NET&amp;#39;s considerable UI capabilities could directly be exploited for visualizing RDF data, only given that the data complies reasonably well with a known ontology.&lt;/p&gt;
&lt;p&gt;If a &lt;a href=&quot;http://dbpedia.org/resource/SPARQL&quot; id=&quot;link-id0xc5d8728&quot;&gt;SPARQL&lt;/a&gt; query returned a result-set, IRI type columns would be returned as .NET instances and the server would pre-fetch enough data for filling them in. For a CONSTRUCT, a collection object could be returned with the objects materialized inside. If the interfaces allow passing an &lt;a href=&quot;http://dbpedia.org/resource/Entity&quot; id=&quot;link-id0x19a434e8&quot;&gt;Entity&lt;/a&gt; &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0x1a146d30&quot;&gt;SQL&lt;/a&gt; string, these could possibly be specialized to allow for a SPARQL string instead. LINQ might have to be extended to allow for SPARQL type queries, though.&lt;/p&gt;
&lt;p&gt;Many of these questions will be better answerable as we get more details on Microsoft&amp;#39;s forthcoming &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x985bc50&quot;&gt;ADO&lt;/a&gt; .NET release. We hope that sufficient latitude exists for exploring all these interesting avenues of development.&lt;/p&gt;
&lt;/div&gt;</description></item><item><title>Object Relational Rediscovered?</title><guid>http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-13#1003</guid><comments>http://virtuoso.openlinksw.com/blog/vdb/blog/?id=1003#comments</comments><pubDate>Thu, 13 Jul 2006 12:33:32 GMT</pubDate><n0:modified xmlns:n0="http://www.openlinksw.com/weblog/">2008-04-16T16:13:26-04:00</n0:modified><description>&lt;div&gt;
&lt;div style=&quot;display:none;&quot;&gt;Object Relational Rediscovered?&lt;/div&gt;
&lt;p&gt;I have recently read some of Microsoft&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/ADO.NET&quot; id=&quot;link-id0x173cea20&quot;&gt;ADO&lt;/a&gt; .NET 3 papers. I am reminded of the distant past when I designed Kubl, which later became OpenLink &lt;a href=&quot;http://virtuoso.openlinksw.com&quot; id=&quot;link-id0x18bdfe68&quot;&gt;Virtuoso&lt;/a&gt;. So I will reminisce and speculate a little.&lt;/p&gt;
&lt;p&gt;So now is the time when polymorphic queries and mixing relational style joins and object style navigation become politically acceptable and even recommended and there finally is a workable solution to having a foreign key in the database and a pointer or set of pointers in the client application. Not to mention change tracking so as to be able to update in-memory &lt;a href=&quot;http://dbpedia.org/resource/Data&quot; id=&quot;link-id0xd6f0ae0&quot;&gt;data&lt;/a&gt; structures and commit a delta against the database without explicit update statements.&lt;/p&gt;
&lt;p&gt;All these questions existed already in the mid 90s and earlier. Since I was coming from OO and LISP into the database world, I even felt these questions to be important. The solution in the earliest Kubl was to have inheritance between tables, what became the &lt;a href=&quot;http://dbpedia.org/resource/SQL&quot; id=&quot;link-id0xddcdac0&quot;&gt;SQL&lt;/a&gt; 2K &lt;code&gt;UNDER&lt;/code&gt; clause, and a virtual column called &lt;code&gt;_ROW&lt;/code&gt; that would select a serialization of the primary key entry. Then there was the function &lt;code&gt;row_key()&lt;/code&gt;, which when applied to a &lt;code&gt;_ROW&lt;/code&gt; virtual column would return a database-wide unique identifier of the row, containing the key info and the key part values plus which subtable of the table was at hand. Then there was a function for dereferencing a &lt;code&gt;row_key&lt;/code&gt; for getting the &lt;code&gt;_ROW&lt;/code&gt;. And one could store &lt;code&gt;row_keys&lt;/code&gt; into columns and dereference these in queries. Within SQL, one could use the &lt;code&gt;row_column&lt;/code&gt; function to extract individual column values from a &lt;code&gt;row_key&lt;/code&gt; or &lt;code&gt;_ROW&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This was all fine server side. But we also had a client for Franz Inc.&amp;#39;s Allegro Common Lisp that talked to Kubl&amp;#39;s &lt;a href=&quot;http://dbpedia.org/resource/Open_Database_Connectivity&quot; id=&quot;link-id0xde2c348&quot;&gt;ODBC&lt;/a&gt; listener. This client had the basic statements and prepared statements and result sets, parameters and array parameters, a little like &lt;a href=&quot;http://dbpedia.org/resource/Java_Database_Connectivity&quot; id=&quot;link-id0x156409f8&quot;&gt;JDBC&lt;/a&gt; does now. But the extra was that we could do a mapping between a Lisp struct or object and a database key, so the &lt;code&gt;_ROW&lt;/code&gt; would automatically materialize into the Lisp struct or class instance. And the mapping between these materializations and the &lt;code&gt;row_keys&lt;/code&gt; identifying them in the database were kept in a thread environment called object space. Updates could be relational-style &lt;code&gt;UPDATEs&lt;/code&gt; or consist of putting a &lt;code&gt;_ROW&lt;/code&gt; serialization in database format back into the Kubl store with a single SQL function.&lt;/p&gt;
&lt;p&gt;This was different from just storing object serializations into LOB columns, as is often done, insofar as the object classes and data members were really database tables and columns, thus native to the DBMS, not just opaque data to be processed client-side only.&lt;/p&gt;
&lt;p&gt;So it was then possible to program a little like is shown in the ADO .NET 3 demos today, some ten years later.&lt;/p&gt;
&lt;p&gt;Some of these functions still exist in Virtuoso, albeit in a deprecated state, and there is no client that can use these to any advantage. Indeed, we dropped this line of work when Kubl became Virtuoso, mostly because there was no standard and no client applications that would use such features. Instead, we concentrated on virtual &lt;a href=&quot;http://dbpedia.org/resource/Relational_database_management_system&quot; id=&quot;link-id0x175a7b10&quot;&gt;RDBMS&lt;/a&gt;, transparently accessing any third party data via ODBC.&lt;/p&gt;
&lt;p&gt;Now however, as objects, both native SQL and Java and .NET, have become mainstream citizens of relational databases in general, Virtuoso and otherwise, and as Microsoft has legitimized accessing whole objects and not only scalar columns in result sets as part of ADO .NET 3, these things might be worth a second look.&lt;/p&gt;
&lt;/div&gt;</description></item>
</channel>
</rss>
