<?xml version="1.0" encoding="UTF-8" ?>
<!--RDF based XML document generated By OpenLink Virtuoso-->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 <rss:channel xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/">
  <rss:title>OpenLink Virtuoso (Product Blog)</rss:title>
  <rss:link>http://virtuoso.openlinksw.com/blog/vdb/blog/</rss:link>
  <rss:description>A great place to track Virtuoso&#39;s rapid evolution.</rss:description>
  <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">kidehen@openlinksw.com</dc:creator>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2013-05-20T14:53:04Z</dc:date>
  <rss:items>
   <rdf:Seq>
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2013-05-13#1729" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-12-03#1727" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-11-28#1725" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-11-27#1721" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-08-16#1719" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-08-07#1717" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-23#1715" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1713" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1712" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1711" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1710" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1709" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1708" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-09-30#1700" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-07-26#1697" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-07-22#1695" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1692" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1690" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1688" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1687" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1686" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1685" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-10#1680" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-10#1679" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-09#1676" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-09#1674" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-07#1672" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-07#1670" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-07#1668" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-04#1666" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-02#1664" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-02-28#1661" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-02-28#1659" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-01-19#1650" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-22#1638" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-22#1637" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1635" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1634" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1633" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-13#1629" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-13#1628" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-14#1623" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-07#1621" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-05#1619" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-02#1617" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-03-15#1615" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-02-12#1607" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-02-12#1606" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-11-11#1588" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-10-27#1586" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1583" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1581" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1580" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1579" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1578" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1573" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-08-19#1571" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-08-14#1569" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-06-29#1563" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-05-28#1558" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1555" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1553" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1552" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-30#1549" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-27#1545" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-04-01#1541" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-03-25#1538" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-03-24#1536" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-03-16#1533" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-03-05#1529" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-02-16#1527" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-01-09#1516" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-01-02#1511" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-18#1507" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1505" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-17#1503" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-16#1501" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-16#1499" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-11#1496" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-11#1495" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-12-09#1493" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-27#1488" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-20#1485" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1481" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1480" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-04#1477" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-03#1473" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-11-03#1472" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1467" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-26#1466" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-24#1460" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1451" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-10-02#1450" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-09-30#1446" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1436" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-09-08#1435" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-09-05#1432" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-08-27#1423" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-08-25#1419" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-08-06#1410" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-08-01#1403" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-07-30#1401" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-07-17#1393" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1383" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1382" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1381" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1380" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-06-09#1379" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-05-30#1369" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-05-09#1359" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-30#1354" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1350" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1349" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-29#1348" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-14#1340" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-14#1339" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-04-14#1338" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-03-25#1327" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-03-06#1322" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-02-05#1313" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-02-04#1309" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-02-01#1305" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2008-01-16#1297" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-12-18#1287" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-12-07#1285" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-11-26#1277" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-11-21#1275" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-11-21#1272" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-11-21#1273" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-11-21#1271" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-09-24#1263" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-09-06#1251" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-08-28#1248" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-08-27#1245" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-07-19#1230" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-07-12#1226" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-06-11#1223" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-05-23#1199" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-05-23#1200" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-05-23#1201" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-05-23#1195" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-04-12#1184" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-03-22#1163" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-03-16#1160" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-02-05#1132" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-01-10#1117" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-01-09#1113" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-01-09#1110" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2007-01-09#1109" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-11-29#1092" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-11-21#1086" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-11-21#1087" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-11-01#1075" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-09-28#1059" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-09-25#1047" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-09-19#1044" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-08-10#1025" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-31#1022" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-18#1011" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-17#1008" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-13#1003" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-13#1001" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-07-11#999" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-04-27#964" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-04-24#962" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2006-04-11#950" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-05-13#845" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-05-13#844" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-05-01#833" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-29#824" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-28#821" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-28#818" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-26#812" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-25#809" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-19#798" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-13#787" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-04-07#786" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-28#774" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-26#768" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-22#762" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-20#759" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-17#756" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-08#750" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-08#749" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-08#743" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-07#735" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-07#738" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-04#730" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-03#727" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-03#724" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-03#723" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-02#717" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-01#714" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-01#712" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-03-01#709" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-02-28#706" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-02-28#703" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-02-28#700" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-02-25#697" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-02-24#694" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-01-11#665" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2005-01-04#659" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-12-24#654" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-12-09#649" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-12-09#647" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-11-12#639" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-10-18#632" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-10-15#732" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-10-05#625" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-09-19#621" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-08-05#608" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-07-22#599" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-07-10#595" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-07-09#592" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-07-08#589" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-07-07#586" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-07-01#575" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-07-01#572" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-06-30#573" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-06-24#566" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-06-09#558" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-06-04#556" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-05-28#554" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-05-20#549" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-05-14#544" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-13#520" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-09#517" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-07#505" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-07#503" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-07#510" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-07#515" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-06#502" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-06#509" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-04-06#514" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-03-24#494" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-03-23#492" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-03-23#490" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-03-23#489" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-03-23#486" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-03-23#485" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-03-17#479" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-02-05#465" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-02-02#461" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-02-02#460" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-01-12#453" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-01-09#450" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2004-01-06#443" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-12-05#441" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-12-03#438" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-12-02#435" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-11-13#430" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-11-11#427" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-11-11#426" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-11-05#418" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-11-05#417" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-31#409" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-30#408" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-28#405" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-26#403" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-24#398" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-23#394" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-14#391" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-03#388" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-10-02#385" />
      <rdf:li rdf:resource="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2003-09-26#376" />
   </rdf:Seq>
  </rss:items>
 </rss:channel>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2013-05-13#1729">
  <rss:title>Virtuoso 7 Release

</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2013-05-13T16:06:02Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The quest of OpenLink Software is to bring flexibility, efficiency, and expressive power to people working with data. For the past several years, this has been focused on making graph data models viable for the enterprise. Flexibility in schema evolution is a central aspect of this, as is the ability to share identifiers across different information systems, i.e., giving things URIs instead of synthetic keys that are not interpretable outside of a particular application. With Virtuoso 7, we dramatically improve the efficiency of all this. With databases in the billions of relations (also known as triples, or 3-tuples), we can fit about 3x as many relations in the same space (disk and RAM) as with Virtuoso 6. Single-threaded query speed is up to 3x better, plus there is intra-query parallelization even in single-server configurations. Graph data workloads are all about random lookups. With these, having data in RAM is all-important. With 3x space efficiency, you can run with 3x more data in the same space before starting to go to disk. In some benchmarks, this can make a 20x gain. Also the Virtuoso scale-out support is fundamentally reworked, with much more parallelism and better deployment flexibility. So, for graph data, Virtuoso 7 is a major step in the coming of age of the technology. Data keeps growing and time is getting scarcer, so we need more flexibility and more performance at the same time. So, let’s talk about how we accomplish this. Column stores have been the trend in relational data warehousing for over a decade. With column stores comes vectored execution, i.e., running any operation on a large number of values at one time. Instead of running one operation on one value, then the next operation on the result, and so forth, you run the first operation on thousands or hundreds-of-thousands of values, then the next one on the results of this, and so on. Column-wise storage brings space efficiency, since values in one column of a table tend to be alike -- whether repeating, sorted, within a specific range, or picked from a particular set of possible values. With graph data, where there are no columns as such, the situation is exactly the same -- just substitute the word predicate for column. Space efficiency brings speed -- first by keeping more of the data in memory; secondly by having less data travel between CPU and memory. Vectoring makes sure that data that are closely located get accessed in close temporal proximity, hence improving cache utilization. When there is no locality, there are a lot of operations pending at the same time, as things always get done on a set of values instead of on a single value. This is the crux of the science of columns and vectoring. Of the prior work in column stores, Virtuoso may most resemble Vertica, well described in Daniel Abadi’s famous PhD thesis. Virtuoso itself is described in IEEE Data Engineering Bulletin, March 2012 (PDF). The first experiments in column store technology with Virtuoso were in 2009, published at the SemData workshop at VLDB 2010 in Singapore. We tried storing TPC H as graph data and in relational tables, each with both rows and columns, and found that we could get 6 bytes per quad space utilization with the RDF-ization of TPC H, as opposed to 27 bytes with the row-wise compressed RDF storage model. The row-wise compression itself is 3x more compact than a row-wise representation with no compression. Memory is the key to speed, and space efficiency is the key to memory. Performance comes from two factors: locality and parallelism. Both are addressed by column store technology. This made me a convert. At this time, we also started the EU FP7 project, LOD2, most specifically working with Peter Boncz of CWI, the king of the column store, famous for MonetDB and VectorWise. This cooperation goes on within LOD2 and has extended to LDBC, an FP7 for designing benchmarks for graph and RDF databases. Peter has given us a world of valuable insight and experience in all aspects of avant garde database, from adaptive techniques to query optimization and beyond. One thing that was recently published is the results for Virtuoso cluster at CWI, running analytics on 150 billion relations on CWI’s SciLens cluster. The SQL relational table-oriented databases and property graph-oriented databases (Graph for short) are both rooted in relational database science. Graph management simply introduces extra challenges with regards to scalability. Hence, at OpenLink Software, having a good grounding in the best practices of relational columnar (or column-wise) database management technology is vital. Virtuoso is more prominently known for high-performance RDF-based graph database technology, but the entirety of its SQL relational data management functionality (which is the foundation for graph store) is vectored, and even allows users to choose between row-wise and column-wise physical layouts, index by index. It has been asked: is this a new NoSQL engine? Well, there isn’t really such a thing. There are of course database engines that do not have SQL support and it has become trendy to call them &quot;NoSQL.&quot; So, in this space, Virtuoso is an engine that does support SQL, plus SPARQL, and is designed to do big joins and aggregation (i.e., analytics) and fast bulk load, as well as ACID transactions on small updates, all with column store space efficiency. It is not only for big scans, as people tend to think about column stores, since it can also be used in compact embedded form. Virtuoso also delivers great parallelism and throughput in a scale-out setting, with no restrictions on transactions and no limits on joining. The base is in relational database science, but all the adaptations that RDF and graph workloads need are built-in, with core level support for run-time data-typing, URIs as native Reference types, user-defined custom data types, etc. Now that the major milestone of releasing Virtuoso 7 (open source and commercial editions) has been reached, the next steps include enabling our current and future customers to attain increased agility from big (linked) open data exploits. Technically, it will also include continued participation in DBMS industry benchmarks, such as those from the TPC, and others under development via the Linked Data Benchmark Council (LDBC), plus other social-media-oriented challenges that arise in this exciting data access, integration, and management innovation continuum. Thus, continue to expect new optimization tricks to be introduced at frequent intervals through the open source development branch at GitHub, between major commercial releases. Related Column Store Tutorial from Peter Boncz and Daniel Abadi NewSQL vs NoSQL session by Prof. Michael Stonebraker Virtuoso 7.0 White Paper by Orri Erling (ACM edition)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The quest of <a href="http://www.openlinksw.com/" id="link-id0x7f51106137d8">OpenLink Software</a> is to bring flexibility, efficiency, and expressive power to people working with data. For the past several years, this has been focused on making graph data models viable for the enterprise. Flexibility in schema evolution is a central aspect of this, as is the ability to share identifiers across different information systems, i.e., giving things URIs instead of synthetic keys that are not interpretable outside of a particular application.</p>

<p>With <a href="http://virtuoso.openlinksw.com/news/virtuoso-v7-press-release-20130423" id="link-id0x7f51d7c4e7a8">Virtuoso 7</a>, we dramatically improve the efficiency of all this. With databases in the billions of relations (also known as triples, or 3-tuples), we can fit about 3x as many relations in the same space (disk and RAM) as with Virtuoso 6. Single-threaded query speed is up to 3x better, plus there is intra-query parallelization even in single-server configurations. Graph data workloads are all about random lookups. With these, having data in RAM is all-important. With 3x space efficiency, you can run with 3x more data in the same space before starting to go to disk. In some benchmarks, this can make a 20x gain.</p>

<p>Also the Virtuoso scale-out support is fundamentally reworked, with much more parallelism and better deployment flexibility.</p>

<p>So, for graph data, Virtuoso 7 is a major step in the coming of age of the technology. Data keeps growing and time is getting scarcer, so we need more flexibility and more performance at the same time.</p>

<p>So, let’s talk about how we accomplish this. Column stores have been the trend in relational data warehousing for over a decade. With column stores comes vectored execution, i.e., running any operation on a large number of values at one time. Instead of running one operation on one value, then the next operation on the result, and so forth, you run the first operation on thousands or hundreds-of-thousands of values, then the next one on the results of this, and so on.</p>

<p>Column-wise storage brings space efficiency, since values in one column of a table tend to be alike -- whether repeating, sorted, within a specific range, or picked from a particular set of possible values. With graph data, where there are no columns as such, the situation is exactly the same -- just substitute the word <i>predicate</i> for <i>column.</i> Space efficiency brings speed -- first by keeping more of the data in memory; secondly by having less data travel between CPU and memory. Vectoring makes sure that data that are closely located get accessed in close temporal proximity, hence improving cache utilization. When there is no locality, there are a lot of operations pending at the same time, as things always get done on a set of values instead of on a single value. This is the crux of the science of columns and vectoring.</p>

<p>Of the prior work in column stores, Virtuoso may most resemble <a href="http://www.vertica.com/" id="link-id0x7f51d66bfc88">Vertica</a>, well described in <a href="http://cs-www.cs.yale.edu/homes/dna/" id="link-id0x7f5110b17b68">Daniel Abadi</a>’s famous <a href="http://dspace.mit.edu/handle/1721.1/43043" id="link-id0x7f51d68fdfc8">PhD thesis</a>. Virtuoso itself is described in <a href="http://www.informatik.uni-trier.de/~ley/db/journals/debu/index.html" id="link-id0x7f51d73b8f28">IEEE Data Engineering Bulletin</a>, <a href="http://www.informatik.uni-trier.de/~ley/db/journals/debu/debu35.html" id="link-id0x7f51c6705998">March 2012</a> (<a href="http://bit.ly/166kEnC" id="link-id0x7f51f0862d28">PDF</a>). The first experiments in column store technology with Virtuoso were in 2009, published at the <a href="http://semdata.org/events/2010/vldb" id="link-id0x7f50d10c3508">SemData workshop</a> at <a href="http://www.vldb2010.org/" id="link-id0x7f51d73fbb98">VLDB 2010</a> in Singapore. We tried storing <a href="http://www.tpc.org/tpch/" id="link-id0x7f51d5f43b58">TPC H</a> as graph data and in relational tables, each with both rows and columns, and found that we could get 6 bytes per quad space utilization with the RDF-ization of TPC H, as opposed to 27 bytes with the row-wise compressed RDF storage model. The row-wise compression itself is 3x more compact than a row-wise representation with no compression.</p>

<p>Memory is the key to speed, and space efficiency is the key to memory. Performance comes from two factors: locality and parallelism. Both are addressed by column store technology. This made me a convert.</p>

<p>At this time, we also started the EU FP7 project, <a href="http://lod2.eu/" id="link-id0x7f51d53489d8">LOD2</a>, most specifically working with <a href="http://homepages.cwi.nl/~boncz/" id="link-id0x7f51d5ab9f18">Peter Boncz</a> of CWI, the king of the column store, famous for <a href="http://dbpedia.org/page/MonetDB" id="link-id0x7f51d5e97078">MonetDB</a> and <a href="http://dbpedia.org/page/Vectorwise" id="link-id0x7f51d60bfa88">VectorWise</a>. This cooperation goes on within LOD2 and has extended to <a href="http://www.ldbc.eu/" id="link-id0x7f51d5ccaad8">LDBC</a>, an FP7 for designing benchmarks for graph and RDF databases. Peter has given us a world of valuable insight and experience in all aspects of <i>avant garde</i> database, from adaptive techniques to query optimization and beyond. One thing that was recently published is the <a href="http://bit.ly/14ULX2F" id="link-id0x7f51d69a15e8">results for Virtuoso cluster at CWI</a>, running analytics on 150 billion relations on CWI’s SciLens cluster.</p>

<p>The SQL relational table-oriented databases and property graph-oriented databases (Graph for short) are both rooted in relational database science. Graph management simply introduces extra challenges with regards to scalability. Hence, at OpenLink Software, having a good grounding in the best practices of relational columnar (or column-wise) database management technology is vital.</p>

<p>Virtuoso is more prominently known for high-performance RDF-based graph database technology, but the entirety of its SQL relational data management functionality (which is the foundation for graph store) is vectored, and even allows users to choose between row-wise and column-wise physical layouts, index by index.</p>

<p>It has been asked: is this a new NoSQL engine? Well, there isn’t really such a thing. There are of course database engines that do not have SQL support and it has become trendy to call them &quot;NoSQL.&quot; So, in this space, Virtuoso is an engine that <i>does</i> support SQL, plus SPARQL, and is designed to do big joins and aggregation (i.e., analytics) and fast bulk load, as well as ACID transactions on small updates, all with column store space efficiency. It is not only for big scans, as people tend to think about column stores, since it can also be used in compact embedded form.</p>

<p>Virtuoso also delivers great parallelism and throughput in a scale-out setting, with no restrictions on transactions and no limits on joining. The base is in relational database science, but all the adaptations that RDF and graph workloads need are built-in, with core level support for run-time data-typing, URIs as native Reference types, user-defined custom data types, etc.</p>

<p>Now that the major milestone of releasing Virtuoso 7 (<a href="http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSIndex" id="link-id0x7f51d6f78e88">open source</a> and <a href="http://virtuoso.openlinksw.com/download/" id="link-id0x7f5110cf58f8">commercial editions</a>) has been reached, the next steps include enabling our current and future customers to attain increased agility from big (linked) open data exploits. Technically, it will also include continued participation in DBMS industry benchmarks, such as those from the <a href="http://www.tpc.org/" id="link-id0x7f51d5088008">TPC</a>, and others under development via the Linked Data Benchmark Council (LDBC), plus other social-media-oriented challenges that arise in this exciting data access, integration, and management innovation continuum. Thus, continue to expect new optimization tricks to be introduced at frequent intervals through the open source development branch at <a href="https://github.com/openlink/" id="link-id0x7f51c4addd18">GitHub</a>, between major commercial releases.</p>

<p>
 <i><b>Related</b>
 </i>
</p>
<ul>
 <li>
  <a href="http://bit.ly/17oSWk9" id="link-id0x7f51f0c46f18">Column Store Tutorial from Peter Boncz and Daniel Abadi</a>
 </li>
<li>
  <a href="http://slidesha.re/13RzSfq" id="link-id0x7f511a9f1b98">NewSQL vs NoSQL session by Prof. Michael Stonebraker</a>
</li>
 <li>
  <a href="http://bit.ly/166kEnC" id="link-id0x7f51e1d85de8">Virtuoso 7.0 White Paper by Orri Erling (ACM edition)</a>
 </li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-12-03#1727">
  <rss:title> LDBC: A Socio-technical Perspective </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-12-03T15:24:40Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(Originally posted to the LDBC blog.) In recent days, cyberspace has seen some discussion concerning the relationship of the EU FP7 project LDBC (Linked Data Benchmark Council) and sociotechnical considerations. It has been suggested that LDBC, to its own and the community’s detriment, ignores sociotechnical aspects. LDBC, as research projects go, actually has an unusually large, and as of this early date, successful and thriving sociotechnical aspect, i.e., involvement of users and vendors alike. I will here discuss why, insofar as the technical output of the project goes, sociotechnical metrics are in fact out of scope. Then yet again, to what degree the benefits potentially obtained from the use of LDBC outcomes are in fact realized does have a strong dependence on community building, a social process. One criticism of big data projects we sometimes encounter is the point that data without context is not useful. Further, one cannot just assume that one can throw several data sets together and get meaning from this, as there may be different semantics for similar looking things, just think of 7 different definitions of blood pressure. In its initial user community meeting, LDBC was, according to its charter, focusing mostly on cases where the data is already in existence and of sufficient quality for the application at hand. Michael Brodie, Chief Scientist at Verizon, is a well known advocate of focusing on meaning of data, not only on processing performance. There is a piece on this matter by him, Peter Boncz, Chris Bizer, and myself on the Sigmod Record: &quot;The Meaningful Use of Big Data: Four Perspectives – Four Challenges&quot;. I had a conversation with Michael at a DERI meeting a couple of years ago about measuring the total cost of technology adoption, thus including socio-technical aspects such as acceptance by users, learning curves of various stakeholders, whether in fact one could demonstrate an overall gain in productivity arising from semantic technologies. [in my words, paraphrased] &quot;Can one measure the effectiveness of different approaches to data integration?&quot; asked I. &quot;Of course one can,&quot; answered Michael, &quot;this only involves carrying out the same task with two different technologies, two different teams and then doing a double blind test with users. However, this never happens. Nobody does this because doing the task even once in a large organization is enormously costly and nobody will even seriously consider doubling the expense.&quot; LDBC does in fact intend to address technical aspects of data integration, i.e., schema conversion, entity resolution, and the like. Addressing the sociotechnical aspects of this (whether one should integrate in the first place, whether the integration result adds value, whether it violates privacy or security concerns, whether users will understand the result, what the learning curves are, etc.) is simply too diverse and so totally domain dependent that a general purpose metric cannot be developed, at least not in the time and budget constraints of the project. Further, adding a large human element in the experimental setting (e.g., how skilled the developers are, how well the stakeholders can explain their needs, how often these needs change, etc.) will lead to experiments that are so expensive to carry out and whose results will have so many unquantifiable factors that these will constitute an insuperable barrier to adoption. Experience demonstrates that even agreeing on the relative importance of quantifiable metrics of database performance is hard enough. Overreaching would compromise the project&#39;s ability to deliver its core value. Let us next talk about this. It is only a natural part of the political landscape that the EC&#39;s research funding choices are criticized by some members of the public. Some criticism is about the emphasis on big data. Big data is a fact on the ground, and research and industry need to deal with it. Of course, there have been and will be critics of technology in general on moral or philosophical grounds. Instead of opening this topic, I will refer you to an article by Michael Brodie. In a world where big data is a given, lowering the entry threshold for big data applications, thus making them available not only to government agencies and the largest businesses, seems ethical to me, as per Brodie&#39;s checklist. LDBC will contribute to this by driving greater availability, better performance, and lower cost for these technologies. Once we accept that big data is there and is important, we arrive at the issue of deriving actionable meaning from it. A prerequisite of deriving actionable meaning from big data is the ability to flexibly process this data. LDBC is about creating metrics for this. The prerequisites for flexibly working with data are fairly independent of the specific use case, while the criteria of meaning, let alone actionable analysis, are very domain specific. Therefore, in order to provide the greatest service to the broadest constituency, LDBC focuses on measuring that which is most generic, yet will underlie any decision support or other data processing deployment that involves RDF or graph data. I would say that LDBC is an exceptionally effective use of taxpayer money. LDBC will produce metrics that will drive technological innovation for years to come. The total money spent towards pursuing goals set forth by LDBC is likely to vastly exceed the budget of LDBC. Only think of the person-centuries or even millennia that have gone into optimizing for TPC-C and TPC-H. The vast majority of the money spent for these pursuits is paid by industry, not by research funding. It is spent worldwide, not in Europe alone. Thus, if LDBC is successful, a limited amount of EC research money will influence how much greater product development budgets are spent in the future. This multiplier effect applies of course to highly successful research outcomes in general but is especially clear with LDBC. European research funding has played a significant role in creating the foundations of the RDF/Linked Data scene. LDBC is a continuation of this policy, however the focus has now shifted to reflect the greater maturity of the technology. LDBC is now about making the RDF and graph database sectors into mature industries whose products can predictably tackle the challenges out there.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<i>(Originally posted to <a href="http://www.ldbc.eu/blog/ldbc-socio-technical-perspective" id="link-id0x7f4529b06088">the LDBC blog</a>.)</i>
</p>

<p>In recent days, cyberspace has seen some discussion concerning the relationship of the EU FP7 project <a href="http://www.ldbc.eu/" id="link-id0x7f41908e6af8">LDBC (Linked Data Benchmark Council)</a> and sociotechnical considerations. It has been suggested that LDBC, to its own and the community’s detriment, ignores sociotechnical aspects.</p>

<p>LDBC, as research projects go, actually has an unusually large, and as of this early date, successful and thriving sociotechnical aspect, i.e., involvement of users and vendors alike. I will here discuss why, insofar as the technical output of the project goes, sociotechnical metrics are in fact out of scope.  Then yet again, to what degree the benefits potentially obtained from the use of LDBC outcomes are in fact realized does have a strong dependence on community building, a social process.</p>

<p>One criticism of big data projects we sometimes encounter is the point that data without context is not useful. Further, one cannot just assume that one can throw several data sets together and get meaning from this, as there may be different semantics for similar looking things, just think of 7 different definitions of blood pressure.</p>

<p>In its <a href="http://www.ldbc.eu/events/1st-ldbc-technical-user-community-meeting" id="link-id0x7f477e386598">initial user community meeting</a>, LDBC was, according to its charter, focusing mostly on cases where the data is already in existence and of sufficient quality for the application at hand.</p>

<p>Michael Brodie, Chief Scientist at Verizon, is a well known advocate of focusing on meaning of data, not only on processing performance. There is a piece on this matter by him, Peter Boncz, Chris Bizer, and myself on the Sigmod Record: &quot;<a href="http://www.sigmod.org/publications/sigmod-record/1112/pdfs/10.report.bizer.pdf" id="link-id0x7f43522f8398">The Meaningful Use of Big Data: Four Perspectives – Four Challenges</a>&quot;.</p>

<p>I had a conversation with Michael at a DERI meeting a couple of years ago about measuring the total cost of technology adoption, thus including socio-technical aspects such as acceptance by users, learning curves of various stakeholders, whether in fact one could demonstrate an overall gain in productivity arising from semantic technologies.  [in my words, paraphrased] </p>
<blockquote>
<p>
  <i>&quot;Can one measure the effectiveness of different approaches to data integration?&quot;</i> asked I. </p>
 <p>
  <i>&quot;Of course one can,&quot;</i> answered Michael, <i>&quot;this only involves carrying out the same task with two different technologies, two different teams and then doing a double blind test with users.  However, this never happens. Nobody does this because doing the task even once in a large organization is enormously costly and nobody will even seriously consider doubling the expense.&quot;</i>
 </p>
</blockquote>

<p>LDBC does in fact intend to address technical aspects of data integration, i.e., schema conversion, entity resolution, and the like. Addressing the sociotechnical aspects of this (whether one should integrate in the first place, whether the integration result adds value, whether it violates privacy or security concerns, whether users will understand the result, what the learning curves are, etc.) is simply too diverse and so totally domain dependent that a general purpose metric cannot be developed, at least not in the time and budget constraints of the project.  Further, adding a large human element in the experimental setting (e.g., how skilled the developers are, how well the stakeholders can explain their needs, how often these needs change, etc.) will lead to experiments that are so expensive to carry out and whose results will have so many unquantifiable factors that these will constitute an insuperable barrier to adoption.</p>

<p>Experience demonstrates that even agreeing on the relative importance of quantifiable metrics of database performance is hard enough. Overreaching would compromise the project&#39;s ability to deliver its core value. Let us next talk about this.</p>

<p>It is only a natural part of the political landscape that the EC&#39;s research funding choices are criticized by some members of the public. Some criticism is about the emphasis on big data.  Big data is a fact on the ground, and research and industry need to deal with it. Of course, there have been and will be critics of technology in general on moral or philosophical grounds. Instead of opening this topic, I will refer you to <a href="http://www.michaelbrodie.com/michael_brodie_statement.asp" id="link-id0x7f43528f65c8">an article by Michael Brodie</a>.  In a world where big data is a given, lowering the entry threshold for big data applications, thus making them available not only to government agencies and the largest businesses, seems ethical to me, as per Brodie&#39;s checklist. LDBC will contribute to this by driving greater availability, better performance, and lower cost for these technologies.</p>

<p>Once we accept that big data is there and is important, we arrive at the issue of deriving actionable meaning from it. A prerequisite of deriving actionable meaning from big data is the ability to flexibly process this data. LDBC is about creating metrics for this. The prerequisites for flexibly working with data are fairly independent of the specific use case, while the criteria of meaning, let alone actionable analysis, are very domain specific. Therefore, in order to provide the greatest service to the broadest constituency, LDBC focuses on measuring that which is most generic, yet will underlie any decision support or other data processing deployment that involves RDF or graph data.</p>

<p>I would say that LDBC is an exceptionally effective use of taxpayer money.  LDBC will produce metrics that will drive technological innovation for years to come.  The total money spent towards pursuing goals set forth by LDBC is likely to vastly exceed the budget of LDBC. Only think of the person-centuries or even millennia that have gone into optimizing for TPC-C and TPC-H. The vast majority of the money spent for these pursuits is paid by industry, not by research funding. It is spent worldwide, not in Europe alone.</p>

<p>Thus, if LDBC is successful, a limited amount of EC research money will influence how much greater product development budgets are spent in the future.  This multiplier effect applies of course to highly successful research outcomes in general but is especially clear with LDBC.</p>

<p>European research funding has played a significant role in creating the foundations of the RDF/Linked Data scene.  LDBC is a continuation of this policy, however the focus has now shifted to reflect the greater maturity of the technology.  LDBC is now about making the RDF and graph database sectors into mature industries whose products can predictably tackle the challenges out there.</p>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-11-28#1725">
  <rss:title>LDBC - the Linked Data Benchmark Council</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-11-28T17:08:37Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">(This posting was inadvertently delayed from the time of its writing, 2012-11-21.) The Linked Data Benchmark Council (LDBC) project is officially starting now. This represents a serious effort towards making relevant and well thought out metrics for RDF and graph databases and defining protocols for measurement and publishing of well documented and reproducible results. This also entails the creation of a TPC-analog for the graph and RDF domains. The project brings together leading vendors, with OpenLink and Ontotext representing the RDF side and Neo Technology and Sparsity Technologies representing the graph database side. Peter Boncz of MonetDB and Vectorwise fame is the technical director, with participation from the Technical University of Munich with Thomas Neumann, known for RDF3X and HyPer. La Universitat Politècnica de Catalunya coordinates the project and brings strong academic expertise in graph databasing, also representing their Sparsity Technologies spinoff. FORTH (Foundation for Research and Technology - Hellas) of Crete contributes expertise in data integration and provenance. STI Innsbruck participates in community building and outreach. The consortium has second-to-none understanding of benchmarking and has sufficient time allotted to the task for producing world class work, comparable to the TPC benchmarks. This has to date never been realized in the RDF or graph space. History demonstrates that whenever something that is sufficiently important starts getting systematically measured, there is an improvement in the metric. The early days of the TPC saw a 40-fold increase in transaction processing speed. TPC-H continues to be, after 18 years, well used as a basis of quantifying advances in analytics databases. A serious initiative for well-thought-out benchmarks for guiding the emerging RDF and graph database markets is nothing short of a necessary precondition for the emergence of a serious market with several vendors offering mutually comparable products. Benchmarks are only as good as their credibility and adoption. For this reason, LDBC has been in touch with all graph and RDF vendors we could find, and has received a positive statement of intent from most, indicating that they would participate in a LDBC organization and contribute to shaping benchmarks. There is further a Technical User Community, with its initial meeting this week, where present-day end users of RDF and graph databases will voice their wishes for benchmark development. Thus benchmarks will be grounded in use cases contributed by real users. With these elements in place we have every reason to expect relevant benchmarks with broad adoption, with all the benefits this entails.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<i>(This posting was inadvertently delayed from the time of its writing, 2012-11-21.)</i>
</p>

<p>The <a href="http://www.ldbc.eu/" id="link-id0x7f4780561218">Linked Data Benchmark Council (LDBC)</a> project is officially starting now.</p>

<p>This represents a serious effort towards making relevant and well thought out <a href="http://dbpedia.org/page/Software_metric" id="link-id0x7f46086bb908">metrics</a> for <a href="http://dbpedia.org/page/Resource_Description_Framework" id="link-id0x7f4750f49728">RDF</a> and graph databases and defining protocols for measurement and publishing of well documented and reproducible results.  This also entails the creation of a <a href="http://www.tpc.org/" id="link-id0x7f47805a1838">TPC</a>-analog for the graph and RDF domains.</p>


<p>The project brings together leading vendors, with <a href="http://dbpedia.org/page/OpenLink_Software" id="link-id0x7f47801c9ce8">OpenLink</a> and <a href="http://dbpedia.org/page/Ontotext" id="link-id0x7f478059b2f8">Ontotext</a> representing the RDF side and <a href="http://dbpedia.org/resource/Neo_Technology" id="link-id0x7f47813a0968">Neo Technology</a> and <a href="http://www.sparsity-technologies.com/" id="link-id0x7f4780128c48">Sparsity Technologies</a> representing the graph database side.   <a href="http://homepages.cwi.nl/~boncz/" id="link-id0x7f47808001c8">Peter Boncz</a> of <a href="http://dbpedia.org/page/MonetDB" id="link-id0x7f4629543fc8">MonetDB</a> and <a href="http://dbpedia.org/page/Vectorwise" id="link-id0x7f47800f84d8">Vectorwise</a> fame is the technical director, with participation from the <a href="http://dbpedia.org/page/Technical_University_Munich" id="link-id0x7f47812e0108">Technical University of Munich</a> with <a href="http://www-db.in.tum.de/~neumann/" id="link-id0x7f4780428808">Thomas Neumann</a>, known for <a href="http://code.google.com/p/rdf3x/" id="link-id0x7f478218ae38">RDF3X</a> and <a href="http://www-db.in.tum.de/research/projects/HyPer/" id="link-id0x7f47805eceb8">HyPer</a>.  <a href="http://dbpedia.org/page/Polytechnic_University_of_Catalonia" id="link-id0x7f46295754c8">La Universitat Politècnica de Catalunya</a> coordinates the project and brings strong academic expertise in graph databasing, also representing their Sparsity Technologies spinoff.  <a href="http://www.forth.gr/" id="link-id0x7f47805857e8">FORTH (Foundation for Research and Technology - Hellas) of Crete</a> contributes expertise in data integration and provenance. <a href="http://www.sti-innsbruck.at/" id="link-id0x7f47804aee08">STI Innsbruck</a> participates in community building and outreach.</p>

<p>The consortium has second-to-none understanding of benchmarking and has sufficient time allotted to the task for producing world class work, comparable to the TPC benchmarks.  This has to date never been realized in the RDF or graph space.</p>

<p>History demonstrates that whenever something that is sufficiently important starts getting systematically measured, there is an improvement in the metric.  The early days of the TPC saw a 40-fold increase in transaction processing speed.  TPC-H continues to be, after 18 years, well used as a basis of quantifying advances in analytics databases.</p>

<p>A serious initiative for well-thought-out benchmarks for guiding the emerging RDF and graph database markets is nothing short of a necessary precondition for the emergence of a serious market with several vendors offering mutually comparable products.</p>


<p>Benchmarks are only as good as their credibility and adoption.  For this reason, LDBC has been in touch with all graph and RDF vendors we could find, and has received a positive statement of intent from most, indicating that they would participate in a LDBC organization and contribute to shaping benchmarks.</p>


<p>There is further a Technical User Community, with its initial meeting this week, where present-day end users of RDF and graph databases will voice their wishes for benchmark development.  Thus benchmarks will be grounded in use cases contributed by real users.</p>


<p>With these elements in place we have every reason to expect relevant benchmarks with broad adoption, with all the benefits this entails.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-11-27#1721">
  <rss:title>LDBC Technical User Community Meeting</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-11-27T22:18:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The LDBC Technical User Community (TUC) had its initial meeting in Barcelona last week. First we wish to thank the many end user organizations that were present. This clearly validates the project&#39;s mission and demonstrates that there is acute awareness of the need for better metrics in the field. In the following, I will summarize the requirements that were brought forth. Scale out - There was near unanimity among users that even if present workloads could be handled on single servers, a scale-out growth path was highly desirable. On the other hand, some applications were scale-out based from the get go. Even when not actually used, a scale-out capability is felt to be an insurance against future need. Making limits explicit - How far can this technology go? Benchmarks need to demonstrate at what scales the products being considered work best, and where they will grind to a halt. Also, the impact of scale-out on performance needs to be made clear. The cost of solutions at different scales must be made explicit. Many of these requirements will be met by simply following TPC practices. Now, vendors cannot be expected to publish numbers for cases where their products fail, but they do have incentives for publishing numbers on large data, and at least giving a price/performance point that exceeds most user needs. Fault tolerance and operational characteristics - Present day benchmarks (e.g., the TPC ones) hardly address operational aspects that most enterprise deployments will encounter. This was already stated by Michael Stonebraker at the first TPC performance evaluation workshop some years back at VLDB in Lyon. Users want to know the price/performance impact of making fault-tolerant systems and wish to have metrics for things like backup and bulk load under online conditions. A need to operate across multiple geographies was present in more than one use case, thus requiring a degree of asynchronous replication such as log shipping. Update-intensive workloads - Unlike one might think, RDF uses are not primarily load-once-plus-lookup. Freshness of data creates value, and databases, even if they are warehouses in character, need to be kept up to date much better than just by periodic reload. Online updates may be small, as for example refreshing news feeds or web crawls, where the unit of update is small but updates are many, but also replacing reference data sets of hundreds of millions of triples. The latter requirement exceeds what is practical in a single transaction. ACID was generally desired, with some interest also in eventual consistency. We did not get use cases with much repeatable read (e.g., updating account balances), but rather atomic and durable replacement of sets of statements. Inference - Class and property hierarchies were common, followed by use of transitivity. owl:sameAs was not in much use, being too dangerous, i.e., a single statement may potentially have huge effect and produce unpredictable sets of properties for instances, for which applications are not prepared. Beyond these, the wishes for inference, with use cases ranging from medicine to forensics, were outside of the OWL domain. These typically involved probability scores adding up the joint occurrence of complex criteria with some numeric computation (e.g. time intervals, geography, etc.). As materialization of forward closure is the prevalent mode of implementing inference in RDF, users wished to have a measure of its cost in space and time, especially under online-update loads. Text, XML, and Geospatial - There is no online application that does not have text search. In publishing, this is hardly ever provided by an RDF store, even if there is one in the mix. Even so, there is an understandable desire to consolidate systems, i.e., to not have an XML database for content and a separate RDF database for metadata. Also, many applications have a geospatial element. One wish was to combine XPATH/XQuery with SPARQL, and it was implied that query optimization should create good plans under these conditions. There was extensive discussion especially on benchmarking full-text. Such a benchmark would need to address the quality of relevance ranking. Doing new work in this space is clearly out of scope for LDBC, but an IR benchmark could be reused as an add-on to provide a quality score. The performance score would come from the LDBC side of the benchmark. Now, many of the applications of text (e.g., news) might not even sort on text match score, but rather by time. Also if the text search is applied to metadata like labels or URI strings, the quality of a match is a non-issue, as there is no document context. Data integration - Almost all applications had some element of data integration. Indeed, if one uses RDF in the first place, the motivation usually has to do with schema flexibility. Having a relational schema for everything is often seen to be too hard to maintain and to lead to too much development time before an initial version of an application or answer of a business question. Data integration is everywhere but stays elusive for benchmarking. Every time it is different and most vendors present do not offer products for this specific need. Many ideas were presented, including using SPARQL for entity resolution, and for checking consistency of an integration result. A central issue of benchmark design is having an understandable metric. People cannot make sense of more than a few figures. The TPC practice of throughput at scale and price per unit of throughput at scale is a successful example. However, it may be difficult to agree on relative weights of components if a metric is an aggregate of too many things. Also, if a benchmark has too many optional parts, metrics easily become too complicated. On the other hand, requiring too many features (e.g. XML, full text, geospatial) restricts the number of possible participants. To stimulate innovation, a benchmark needs to be difficult but restricted to a specific domain. TPC-H is a good example, favoring specialized systems built for analytics alone. To be a predictor of total cost and performance in a complex application, a benchmark must include much more functionality, and will favor general purpose systems that do many things but are not necessarily outstanding in any single aspect. After 1-1/2 days with users, the project team met to discuss actual benchmark task forces to be started. The conclusion was that work would initially proceed around two use cases: publishing, and social networks. The present use of RDF by the BBC and the Press Association provides the background scenario for the publishing benchmark, and the work carried out around the Social Intelligence Benchmark (SIB) in LOD2 will provide a starting point for the social network benchmark. Additionally, user scenarios from the DEX graph database user base will help shape the SN workload. A data integration task force needs more clarification, but work in this direction is in progress. In practice, driving progress needs well-focused benchmarks with special trick questions intended to stress specific aspects of a database engine. Providing an overall perspective on cost and online operations needs a broad mix of features to be covered. These needs will be reconciled by having many metrics inside a single use case, i.e., a social network data set can be used for transactional updates, for lookup queries, for graph analytics, and for TPC-H style business intelligence questions, especially if integrated with another more-relational dataset. Thus there will be a mix of metrics, from transactions to analytics, with single and multiuser workloads. Whether these are packaged as separate benchmarks, or as optional sections of one, remains to be seen.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The <a href="http://www.ldbc.eu/" id="link-id0x7f4740087de8">LDBC</a> Technical User Community (TUC) had <a href="http://www.ldbc.eu/events/1st-ldbc-technical-user-community-meeting" id="link-id0x7f47800d4b48">its initial meeting</a> in Barcelona last week.</p>

<p>First we wish to thank the many end user organizations that were present. This clearly validates the project&#39;s mission and demonstrates that there is acute awareness of the need for better metrics in the field. In the following, I will summarize the requirements that were brought forth.</p>

<ul>
 <li>
  <p>
    <b>Scale out</b> - There was near unanimity among users that even if present workloads could be handled on single servers, a scale-out growth path was highly desirable.  On the other hand, some applications were scale-out based from the get go.  Even when not actually used, a scale-out capability is felt to be an insurance against future need.</p>
 </li>

<li>
  <p>
    <b>Making limits explicit</b> - How far can this technology go?  <a href="http://dbpedia.org/page/Benchmark_%28computing%29" id="link-id0x7f4748067b58">Benchmarks</a> need to demonstrate at what scales the products being considered work best, and where they will grind to a halt.  Also, the impact of scale-out on performance needs to be made clear.  The cost of solutions at different scales must be made explicit.</p>

<p>Many of these requirements will be met by simply following <a href="http://www.tpc.org/" id="link-id0x7f4751c0fa08">TPC</a> practices.  Now, vendors cannot be expected to publish numbers for cases where their products fail, but they do have incentives for publishing numbers on large data, and at least giving a price/performance point that exceeds most user needs.</p>
</li>

<li>
  <p>
    <b>Fault tolerance and operational characteristics</b> - Present day benchmarks (e.g., the TPC ones) hardly address operational aspects that most enterprise deployments will encounter.  This was already stated by <a href="http://dbpedia.org/page/Michael_Stonebraker" id="link-id0x7f47800d5a98">Michael Stonebraker</a> at the first TPC performance evaluation workshop some years back at <a href="http://www.vldb.org/archives/website/2009/" id="link-id0x7f4748066bd8">VLDB in Lyon</a>.  Users want to know the price/performance impact of making fault-tolerant systems and wish to have metrics for things like backup and bulk load under online conditions.  A need to operate across multiple geographies was present in more than one use case, thus requiring a degree of asynchronous replication such as log shipping.</p>
</li>

<li>
  <p>
    <b>Update-intensive workloads</b> - Unlike one might think, <a href="http://dbpedia.org/page/Resource_Description_Framework" id="link-id0x7f4782546868">RDF</a> uses are not primarily load-once-plus-lookup.  Freshness of data creates value, and databases, even if they are warehouses in character, need to be kept up to date much better than just by periodic reload. Online updates may be small, as for example refreshing news feeds or web crawls, where the unit of update is small but updates are many, but also replacing reference data sets of hundreds of millions of triples.  The latter requirement exceeds what is practical in a single transaction.  <a href="http://dbpedia.org/page/ACID" id="link-id0x7f4740070db8">ACID</a> was generally desired, with some interest also in eventual consistency.   We did not get use cases with much repeatable read (e.g., updating account balances), but rather atomic and durable replacement of sets of statements.</p>
</li>


<li>
  <p>
    <b>Inference</b> - Class and property hierarchies were common, followed by use of transitivity.  <code><a href="http://www.w3.org/TR/owl2-overview/" id="link-id0x7f47480c5b48">owl:sameAs</a></code> was not in much use, being <a href="http://events.linkeddata.org/ldow2010/papers/ldow2010_paper09.pdf" id="link-id0x7f474007a198">too dangerous</a>, i.e., a single statement may potentially have huge effect and produce unpredictable sets of properties for instances, for which applications are not prepared.  Beyond these, the wishes for inference, with use cases ranging from medicine to forensics, were outside of the OWL domain.  These typically involved probability scores adding up the joint occurrence of complex criteria with some numeric computation (e.g. time intervals, geography, etc.).</p>

<p>As materialization of forward closure is the prevalent mode of implementing inference in RDF, users wished to have a measure of its cost in space and time, especially under online-update loads.</p>
</li>


<li>
  <p>
    <b>Text, XML, and Geospatial</b> - There is no online application that does not have text search.  In publishing, this is hardly ever provided by an RDF store, even if there is one in the mix.  Even so, there is an understandable desire to consolidate systems, i.e., to not have an XML database for content and a separate RDF database for metadata.  Also, many applications have a geospatial element.  One wish was to combine XPATH/XQuery with SPARQL, and it was implied that query optimization should create good plans under these conditions.</p>

<p>There was extensive discussion especially on benchmarking full-text. Such a benchmark would need to address the quality of relevance ranking.  Doing new work in this space is clearly out of scope for LDBC, but an IR benchmark could be reused as an add-on to provide a quality score.  The performance score would come from the LDBC side of the benchmark.  Now, many of the applications of text (e.g., news) might not even sort on text match score, but rather by time. Also if the text search is applied to metadata like labels or URI strings, the quality of a match is a non-issue, as there is no document context.</p>
</li>

<li>
  <p>
    <b>Data integration</b> - Almost all applications had some element of data integration.  Indeed, if one uses RDF in the first place, the motivation usually has to do with schema flexibility.  Having a relational schema for everything is often seen to be too hard to maintain and to lead to too much development time before an initial version of an application or answer of a business question.  Data integration is everywhere but stays elusive for benchmarking.  Every time it is different and most vendors present do not offer <a href="http://virtuoso.openlinksw.com/middleware/" id="link-id0x7f4728113608">products for this specific need</a>.   Many ideas were presented, including using SPARQL for entity resolution, and for checking consistency of an integration result.</p>
</li>
</ul>

<p>A central issue of benchmark design is having an understandable <a href="http://dbpedia.org/page/Software_metric" id="link-id0x7f47280fac58">metric</a>.  People cannot make sense of more than a few figures.  The TPC practice of throughput at scale and price per unit of throughput at scale is a successful example.  However, it may be difficult to agree on relative weights of components if a metric is an aggregate of too many things.  Also, if a benchmark has too many optional parts, metrics easily become too complicated.  On the other hand, requiring too many features (e.g. XML, full text, geospatial) restricts the number of possible participants.</p>

<p>To stimulate innovation, a benchmark needs to be difficult but restricted to a specific domain.  <a href="http://www.tpc.org/tpch/" id="link-id0x7f47400b3ad8">TPC-H</a> is a good example, favoring specialized systems built for analytics alone.  To be a predictor of total cost and performance in a complex application, a benchmark must include much more functionality, and will favor general purpose systems that do many things but are not necessarily outstanding in any single aspect.</p>


<p>After 1-1/2 days with users, the project team met to discuss actual benchmark task forces to be started.  The conclusion was that work would initially proceed around two use cases: publishing, and social networks.  The present use of RDF by the <a href="http://dbpedia.org/page/BBC" id="link-id0x7f478009ee38">BBC</a> and the <a href="http://dbpedia.org/page/Press_Association" id="link-id0x7f478014c998">Press Association</a> provides the background scenario for the publishing benchmark, and the work carried out around the <a href="http://sourceforge.net/projects/sibenchmark/" id="link-id0x7f47480ae1d8">Social Intelligence Benchmark (SIB)</a> in <a href="http://lod2.eu/" id="link-id0x7f4780f71458">LOD2</a> will provide a starting point for the social network benchmark. Additionally, user scenarios from the <a href="http://dbpedia.org/page/DEX_%28Graph_database%29" id="link-id0x7f47400704e8">DEX graph database</a> user base will help shape the SN workload.</p>

<p>A data integration task force needs more clarification, but work in this direction is in progress.</p>


<p>In practice, driving progress needs well-focused benchmarks with special trick questions intended to stress specific aspects of a database engine.   Providing an overall perspective on cost and online operations needs a broad mix of features to be covered.</p>

These needs will be reconciled by having many metrics inside a single use case, i.e., a social network data set can be used for transactional updates, for lookup queries, for graph analytics, and for TPC-H style business intelligence questions, especially if integrated with another more-relational dataset.  Thus there will be a mix of metrics, from transactions to analytics, with single and multiuser workloads. Whether these are packaged as separate benchmarks, or as optional sections of one, remains to be seen.
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-08-16#1719">
  <rss:title>Developer Recruitment Exercise</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-08-16T19:28:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The specification of the exercise referred to in the previous post may be found below. Questions on the exercise can be sent to the email specified in the previous post. I may schedule a phone call to answer questions based on the initial email contact. We seek to have all applicants complete the exercise before October 1. General The exercise consists of implementing a part of the TPC-C workload in memory, in C or C++. TPC-C is the long-time industry standard benchmark for transaction processing performance. We use this as a starting point for an exercise for assessing developer skill level in writing heavily multithreaded, performance-critical code. The application performs a series of transactions against an in-memory database, encountering lock contention and occasional deadlocks. The application needs to provide atomicity, consistency, and isolation for transactions. The task consists of writing the low-level data structures for storing the memory-resident database and for managing concurrency, including lock queueing, deadlock detection, and commit/rollback. The solutions are evaluated based on their actual measured multithreaded performance on commodity servers, e.g., 8- or 12-cores of Intel Xeon. OpenLink provides the code for data generation and driving the test. This is part of the TPC-C kit in Virtuoso Open Source. The task is to replace the SQL API calls with equivalent in-process function calls against the in-memory database developed as part of the exercise. Rules We are aware that the best solution to the problem may be running transactions single-threaded against in-memory hash tables without any concurrency control. The application data may be partitioned so that a single transaction can be in most cases assigned to a partition, which it will get for itself for the few microseconds it takes to do its job. For this exercise, this solution is explicitly ruled out. The application must demonstrate shared access to data, with a transaction holding multiple concurrent locks and being liable to deadlock. TPC-C can be written so as to avoid deadlocks by always locking in a certain order. This is also expressly prohibited; in specific, the stock rows of a new order transaction must be locked in the order they are specified in the invocation. In application terms this makes no sense, but for purposes of the exercise this will serve as a natural source of deadlocks. Parameters The application needs to offer an interactive or scripted interface (command line is OK) which provides the following operations: Clear and initialize a database of n warehouses. Run n threads, each doing m new order transactions. Each thread has a home warehouse and occasionally accesses other warehouse&#39;s data. This reports the real time elapsed and the number of retries arising from deadlocks. Check the consistency between the stock, orders, and order_line data structures. Report system status such as clocks spent waiting for specific mutexes. This is supplied as part of the OpenLink library used by the data generator. Data Structures The transactions are written as C functions. The data is represented as C structs, and tree indices or hash tables are used for value-based access to the structures by key. The application has no persistent storage. The structures reference each other by the key values as in the database, so no direct pointers. The key values are to be translated into pointers with a hash table or other index-like structure. The application must be thread-safe, and transactions must be able to roll back. Transactions will sometimes wait for each other in updating shared resources such as stock or district or warehouse balances. The application must be written so as to implement fine-grained locking, and each transaction must be able to hold multiple locks. The application must be able to detect deadlocks. For deadlock recovery, it is acceptable to abort the transaction that detects the deadlock. C++ template libraries may be used but one must pay attention to their efficiency. The new order transaction is the only required transaction. All numbers can be represented as integers. This holds equally for key columns as for monetary amounts. All index structures (e.g., hash tables) in the application must be thread safe, so that an insert would be safe with concurrent access or concurrent inserts. This holds also for index structures for tables which do not get inserts in the test (e.g. item, customer, stock, etc.). A sequence object must not be used for assigning new values to the O_ID column of ORDERS. These values must come from the D_NEXT_O_ID column of the DISTRICT table. If a new order transaction rolls back, its update of D_NEXT_O_ID is also rolled back. This causes O_ID values to always be consecutive within a district. TPC-C Functionality The application must implement the TPC-C new order transaction in full. This must not avoid deadlocks by ordering locking on stock rows. See the rules section. The transaction must have the semantics specified in TPC-C, except for durability. Supporting Files The test driver calling the transaction procedures is in tpccodbc.c. This can be reused so as to call the transaction procedure in process instead of the ODBC exec. The user interface may be a command line menu with run options for different numbers of transactions with different thread counts and an option for integrity check. The integrity check consists of verifying s_cnt_order against the orders and checking that max (O_ID) and D_NEXT_O_ID match within each district. Running the application should give different statistics such as CPU%, cumulative time spent waiting for locks, etc. The rdtsc instruction can be used for getting clock counts for timing. Points to Note This section summarizes some of the design patterns and coding tricks we expect to see in a solution to the exercise. These may seem self-evident to some, but experience indicates that this is not universally so. The TPC-C transaction profile for new order specifies a semantics for the operation. The order of locking is left to the implementation as long as the semantics are in effect. The application will be tested with many clients on the same warehouse, running as fast as they can. So lock contention is expected. Therefore, the transaction should be written so as to acquire the locks with the greatest contention as late as possible. No locks need be acquired for the item table since none of the transactions will update it. For implementing locks, using a mutex to serialize access to application resources is not enough. Many locks will be acquired by each transaction, in an unpredictable order. Unless explicit queueing for locks is implemented with deadlock detection, the application will not work. If waiting for a mutex causes the operating system to stop a thread, even when there are cores free, the latency is multiple microseconds, even if the mutex is released by its owner on the next cycle after the waiting thread is suspended. This will destroy any benefit from parallelism unless one is very careful. Programmers do not seem to instinctively know this. Therefore any structure to which access must be serialized (e.g. hash tables, locks, etc.) needs to be protected by a mutex but must be partitioned so that there are tens or hundreds of mutexes depending on which section of the structure one is accessing. Submissions that protect a hash table or other index-like structure for a whole application table with a single mutex or rw lock will be discarded off the bat. Even while using many mutexes, one must hold them for a minimum of time. When accessing a hash table, do the invariant parts first; acquire the mutex after that. For example, if you calculate the hash number after acquiring the mutex for the hash table, the submission will be rejected. The TPC-C application has some local and some scattered access. Orders are local, and stock and item lines are scattered. When doing scattered memory accesses, the program should be written so that the CPU will, from a single thread, have multiple concurrent cache misses in flight at all times. So, when accessing 10 stock lines, calculate the hash numbers first; then access the memory, deferring any branches based on the accessed values. In this way, out of order execution will miss the CPU cache for many independent addresses in parallel. One can use the gcc __builtin_prefetch primitive, or simply write the program so as to have mutually data-independent memory accesses in close proximity. For detecting deadlocks, a global transaction wait graph may have to be maintained. This will need to be maintained in a serialized manner. If many threads access this, the accesses must be serialized on a global mutex. This may be very bad if the deadlock detection takes a long time. Alternately, the wait graph may be maintained on another thread. The thread will get notices of waits and transacts from worker threads with some delay. Having spotted a cycle, it may kill one or another party. This will require some inter-thread communication. The submission may address this matter in any number of ways. However, just acquiring a lock without wait must not involve getting a global mutex. Going to wait will have to do so, were it only for queueing a notice to a monitor thread. Using a socket-to-self might appear to circumvent this, but the communication stack will have mutexes inside so this is no better. Evaluation Criteria The exercise will be evaluated based on the run time performance, especially multicore scalability of the result. Extra points are not given for implementing interfaces or for being object oriented. Interfaces, templates, and objects are not forbidden as such, but their cost must not exceed the difference between getting an address from a virtual table and calling a function directly. The locking implementation must be correct. It can be limited to exclusive locks and need not support isolation other than repeatable read. Running the application must demonstrate deadlocks and working recovery from these. Code and Libraries To Be Used The TPC-C data generator and test driver are in the Virtuoso Open Source distribution, in the files binsrc/tests/tpcc*.c and files included from these. You can make the exercise in the same directory and just alter the files or make script. The application is standalone and has no other relation to the Virtuoso code. The libsrc/Thread threading wrappers may be used. If not using these, make a wrapper similar to mutex_enter when MTX_METER is defined so that it counts the waits and clocks spent during wait. Also have a report like that in mutex_stat() for the mutex wait frequency and duration.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The specification of the exercise referred to in <a href="http://www.openlinksw.com/weblog/oerling/?id=1717" id="link-id0x1bc4d790">the previous post</a> may be found below.</p>

<p>Questions on the exercise can be sent to <a href="mailto:hwilliams@openlinksw.com?subject=2012-08%20Virtuoso%20Developer%20Exercise" id="link-id0x1cb00cc0">the email specified in the previous post</a>.  I may schedule a phone call to answer questions based on the initial email contact.</p>

<p>We seek to have all applicants complete the exercise before October 1.</p>

<h2>General</h2>

<p>The exercise consists of implementing a part of the <a href="http://dbpedia.org/resource/TPC-C" id="link-id0x1c158760">TPC-C</a> workload in memory, in <a href="http://dbpedia.org/resource/C_(programming_language)" id="link-id0x1b38bec0"><code>C</code></a> or <a href="http://dbpedia.org/page/C++" id="link-id0x1119e720"><code>C++</code></a>.  TPC-C is the long-time industry standard benchmark for transaction processing performance.  We use this as a starting point for an exercise for assessing developer skill level in writing heavily multithreaded, performance-critical code.</p>

<p>The application performs a series of transactions against an in-memory database, encountering lock contention and occasional deadlocks.  The application needs to provide atomicity, consistency, and isolation for transactions.  The task consists of writing the low-level data structures for storing the memory-resident database and for managing concurrency, including lock queueing, deadlock detection, and commit/rollback.  The solutions are evaluated based on their actual measured multithreaded performance on commodity servers, e.g., 8- or 12-cores of Intel Xeon.</p>

<p>OpenLink provides the code for data generation and driving the test. This is part of the TPC-C kit in <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSIndex" id="link-id0x1be9fab8">Virtuoso Open Source</a>.  The task is to replace the SQL API calls with equivalent in-process function calls against the in-memory database developed as part of the exercise.</p>

<h2>Rules</h2>

<p>We are aware that the best solution to the problem may be running transactions single-threaded against in-memory hash tables without any concurrency control.  The application data may be partitioned so that a single transaction can be in most cases assigned to a partition, which it will get for itself for the few microseconds it takes to do its job. <b>For this exercise, this solution is explicitly ruled out.</b>  The application must demonstrate shared access to data, with a transaction holding multiple concurrent locks and being liable to deadlock.</p>

<p>TPC-C can be written so as to avoid deadlocks by always locking in a certain order.  <b>This is also expressly prohibited;</b> in specific, the stock rows of a new order transaction must be locked in the order they are specified in the invocation.  In application terms this makes no sense, but for purposes of the exercise this will serve as a natural source of deadlocks.</p>

<h2>Parameters</h2>

<p>The application needs to offer an interactive or scripted interface (command line is OK) which provides the following operations:</p>

<ul>
 <li>
  <p>Clear and initialize a database of n warehouses.</p>
 </li>

<li>
  <p>Run <i>n</i> threads, each doing <i>m</i> new order transactions.  Each thread has a home warehouse and occasionally accesses other warehouse&#39;s data. This reports the real time elapsed and the number of retries arising from deadlocks. </p>
</li>

<li>
  <p>Check the consistency between the <code>stock</code>, <code>orders</code>, and <code>order_line</code> data structures.</p>
</li>

<li>
  <p>Report system status such as clocks spent waiting for specific mutexes.  This is supplied as part of the OpenLink library used by the data generator.</p>
</li>
</ul>

<h2>Data Structures</h2>

<p>The transactions are written as <code>C</code> functions.  The data is represented as <code>C</code> structs, and tree indices or hash tables are used for value-based access to the structures by key.  The application has no persistent storage.  The structures reference each other by the key values as in the database, so no direct pointers.  The key values are to be translated into pointers with a hash table or other index-like structure.</p>

<p>The application must be thread-safe, and transactions must be able to roll back.  Transactions will sometimes wait for each other in updating shared resources such as stock or district or warehouse balances.  The application must be written so as to implement fine-grained locking, and each transaction must be able to hold multiple locks. The application must be able to detect deadlocks.  For deadlock recovery, it is acceptable to abort the transaction that detects the deadlock.</p>

<p>
<code>C++</code> template libraries may be used but one must pay attention to their efficiency.</p>

<p>The new order transaction is the only required transaction.</p>

<p>All numbers can be represented as integers.  This holds equally for key
columns as for monetary amounts.</p>  

<p>All index structures (e.g., hash tables) in the application must be thread safe, so that an insert would be safe with concurrent access or concurrent inserts. This holds also for index structures for tables which do not get inserts in the test (e.g. item, customer, stock, etc.).</p>

<p>A sequence object must not be used for assigning new values to the <code>O_ID</code> column of <code>ORDERS</code>.  These values must come from the <code>D_NEXT_O_ID</code> column of the <code>DISTRICT</code> table.  If a new order transaction rolls back, its update of <code>D_NEXT_O_ID</code> is also rolled back.  This causes <code>O_ID</code> values to always be consecutive within a district.</p>

<h2>TPC-C Functionality</h2>

<p>The application must implement the TPC-C new order transaction in full.  This must not avoid deadlocks by ordering locking on stock rows.  See the rules section.</p>

<p>The transaction must have the semantics specified in TPC-C, except for durability.</p>


<h2>Supporting Files</h2>

<p>The test driver calling the transaction procedures is in <code>tpccodbc.c</code>.  This can be reused so as to call the transaction procedure in process instead of the ODBC exec.</p>

<p>The user interface may be a command line menu with run options for different numbers of transactions with different thread counts and an option for integrity check.</p>

<p>The integrity check consists of verifying <code>s_cnt_order</code> against the orders and checking that <code>max (O_ID)</code> and <code>D_NEXT_O_ID</code> match within each district.</p>

<p>Running the application should give different statistics such as CPU%, cumulative time spent waiting for locks, etc.  The <code>rdtsc</code> instruction can be used for getting clock counts for timing.</p>

<h2>Points to Note</h2>

<p>This section summarizes some of the design patterns and coding tricks we expect to see in a solution to the exercise.  These may seem self-evident to some, but experience indicates that this is not universally so.</p>

<ul>
 <li>
  <p>The TPC-C transaction profile for new order specifies a semantics for the operation.  The order of locking is left to the implementation as long as the semantics are in effect.  The application will be tested with many clients on the same warehouse, running as fast as they can.  So lock contention is expected. Therefore, the transaction should be written so as to acquire the locks with the greatest contention as late as possible.  No locks need be acquired for the item table since none of the transactions will update it.</p>
 </li>

<li>
  <p>For implementing locks, using a mutex to serialize access to application resources is not enough.  Many locks will be acquired by each transaction, in an unpredictable order.  Unless explicit queueing for locks is implemented with deadlock detection, the application will not work.</p>
</li>

<li>
  <p>If waiting for a mutex causes the operating system to stop a thread, even when there are cores free, the latency is multiple microseconds, even if the mutex is released by its owner on the next cycle after the waiting thread is suspended.  This will destroy any benefit from parallelism unless one is very careful.  Programmers do not seem to instinctively know this.</p>
</li>
</ul>

<p>Therefore any structure to which access must be serialized (e.g. hash tables, locks, etc.) needs to be protected by a mutex but must be partitioned so that there are tens or hundreds of mutexes depending on which section of the structure one is accessing.</p>

<p>Submissions that protect a hash table or other index-like structure for a whole application table with a single mutex or <code>rw</code> lock will be discarded off the bat.</p>

<p>Even while using many mutexes, one must hold them for a minimum of time. When accessing a hash table, do the invariant parts first; acquire the mutex after that.  For example, if you calculate the hash number after acquiring the mutex for the hash table, the submission will be rejected.</p>

<p>The TPC-C application has some local and some scattered access. Orders are local, and stock and item lines are scattered.  When doing scattered memory accesses, the program should be written so that the CPU will, from a single thread, have multiple concurrent cache misses in flight at all times.  So, when accessing 10 stock lines, calculate the hash numbers first; then access the memory, deferring any branches based on the accessed values.  In this way, out of order execution will miss the CPU cache for many independent addresses in parallel. One can use the gcc <code>__builtin_prefetch</code> primitive, or simply write the program so as to have mutually data-independent memory accesses in close proximity.</p>

<p>For detecting deadlocks, a global transaction wait graph may have to be maintained.  This will need to be maintained in a serialized manner.  If many threads access this, the accesses must be serialized on a global mutex.  This may be very bad if the deadlock detection takes a long time.  Alternately, the wait graph may be maintained on another thread.  The thread will get notices of waits and transacts from worker threads with some delay.  Having spotted a cycle, it may kill one or another party.  This will require some inter-thread communication.  The submission may address this matter in any number of ways.</p>

<p>However, just acquiring a lock without wait must not involve getting a global mutex.  Going to wait will have to do so, were it only for queueing a notice to a monitor thread.  Using a socket-to-self might appear to circumvent this, but the communication stack will have mutexes inside so this is no better.</p>

<h2>Evaluation Criteria</h2>

<p>The exercise will be evaluated based on the run time performance, especially multicore scalability of the result.</p>

<p>Extra points are not given for implementing interfaces or for being object oriented.  Interfaces, templates, and objects are not forbidden as such, but their cost must not exceed the difference between getting an address from a virtual table and calling a function directly.</p>

<p>The locking implementation must be correct.  It can be limited to exclusive locks and need not support isolation other than <code>repeatable read</code>. Running the application must demonstrate deadlocks and working recovery from these.</p>

<h2>Code and Libraries To Be Used</h2>

<p>The TPC-C data generator and test driver are in the <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSGIT" id="link-id0x1be07088">Virtuoso Open Source distribution</a>, in the files <code>binsrc/tests/tpcc*.c</code> and files included from these.  You can make the exercise in the same directory and just alter the files or make script.  The application is standalone and has no other relation to the Virtuoso code.  The <code>libsrc/Thread</code> threading wrappers may be used.  If not using these, make a wrapper similar to <code>mutex_enter</code> when <code>MTX_METER</code> is defined so that it counts the waits and clocks spent during wait.  Also have a report like that in <code>mutex_stat()</code> for the mutex wait frequency and duration.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-08-07#1717">
  <rss:title>Developer Opportunities at OpenLink Software
</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-08-07T17:21:52Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">If it is advanced database technology, you will get to do it with us. We are looking for exceptional talent to implement some of the hardest stuff in the industry. This ranges from new approaches to query optimization; to parallel execution (both scale up and scale out); to elastic cloud deployments and self-managing, self-tuning, fault-tolerant databases. We are most familiar to the RDF world, but also have full SQL support, and the present work will serve both use cases equally. We are best known in the realms of high-performance database connectivity middleware and massively-scalable Linked-Data-oriented graph-model DBMS technology. We have the basics -- SQL and SPARQL, column store, vectored execution, cost based optimization, parallel execution (local and cluster), and so forth. In short, we have everything you would expect from a DBMS. We do transactions as well as analytics, but the greater challenges at present are on the analytics side. You will be working with my team covering: Adaptive query optimization -- interleaving execution and optimization, so as to always make the correct plan choices based on actual data characteristics Self-managing cloud deployments for elastic big data -- clusters that can grow themselves and redistribute load, recover from failures, etc. Developing and analyzing new benchmarks for RDF and graph databases Embedding complex geospatial reasoning inside the database engine. We have the basic R-tree and the OGC geometry data types; now we need to go beyond this Every type of SQL optimizer and execution engine trick that serves to optimize for TPC-H and DS. What do I mean by really good? It boils down to being a smart and fast programmer. We have over the years talked to people, including many who have worked on DBMS programming, and found that they actually know next to nothing of database science. For example, they might not know what a hash join is. Or they might not know that interprocess latency is in the tens of microseconds even within one box, and that in that time one can do tens of index lookups. Or they might not know that blocking on a mutex kills. If you do core database work, we want you to know how many CPU cache misses you will have in flight at any point of the algorithm, and how many clocks will be spent waiting for them at what points. Same for distributed execution: The only way a cluster can perform is having max messages with max payload per message in flight at all times. These are things that can be learned. So I do not necessarily expect that you have in-depth experience of these, especially since most developer jobs are concerned with something else. You may have to unlearn the bad habit of putting interfaces where they do not belong, for example. Or to learn that if there is an interface, then it must pass as much data as possible in one go. Talent is the key. You need to be a self-starter with a passion for technology and have competitive drive. These can be found in many guises, so we place very few limits on the rest. If you show you can learn and code fast, we don&#39;t necessarily care about academic or career histories. You can be located anywhere in the world, and you can work from home. There may be some travel but not very much. In the context of EU FP7 projects, we are working with some of the best minds in database, including Peter Boncz of CWI and VU Amsterdam (MonetDB, VectorWise) and Thomas Neumann of Technical University of Munich (RDF3X, HYPER). This is an extra guarantee that you will be working on the most relevant problems in database, informed by the results of the very best work to date. For more background, please see the IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, Special Issue on Column Store Systems. All articles and references therein are relevant for the job. Be sure to read the CWI work on run time optimization (ROX), cracking, and recycling. Do not miss the many papers on architecture-conscious, cache-optimized algorithms; see the VectorWise and MonetDB articles in the bulletin for extensive references. If you are interested in an opportunity with us, we will ask you to do a little exercise in multithreaded, performance-critical coding, to be detailed in a blog post in a few days. If you have done similar work in research or industry, we can substitute the exercise with a suitable sample of this, but only if this is core database code. There is a dual message: The challenges will be the toughest a very tough race can offer. On the other hand, I do not want to scare you away prematurely. Nobody knows this stuff, except for the handful of people who actually do core database work. So we are not limiting this call to this small crowd and will teach you on the job if you just come with an aptitude to think in algorithms and code fast. Experience has pros and cons so we do not put formal bounds on this. &quot;Just out of high school&quot; may be good enough, if you are otherwise exceptional. Prior work in RDF or semantic web is not a factor. Sponsorship of your M.Sc. or Ph.D. thesis, if the topic is in our line of work and implementation can be done in our environment, is a further possibility. Seasoned pros are also welcome and will know the nature of the gig from the reading list. We are aiming to fill the position(s) between now and October. Resumes and inquiries can be sent to Hugh Williams, hwilliams@openlinksw.com. We will contact applicants for interviews.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>If it is advanced database technology, you will get to do it with us.</p>

<p>We are looking for exceptional talent to implement some of the hardest stuff in the industry.  This ranges from new approaches to query optimization; to parallel execution (both scale up and scale out); to elastic cloud deployments and self-managing, self-tuning, fault-tolerant databases.  We are most familiar to the RDF world, but also have full SQL support, and the present work will serve both use cases equally.</p>

<p>We are best known in the realms of high-performance database connectivity middleware and massively-scalable Linked-Data-oriented graph-model DBMS technology.</p>

<p>We have the basics -- SQL and SPARQL, column store, vectored execution, cost based optimization, parallel execution (local and cluster), and so forth.  In short, we have everything you would expect from a DBMS.  We do transactions as well as analytics, but the greater challenges at present are on the analytics side.</p>

<p>You will be working with my team covering:</p>

<ul>
 <li>
  <p>Adaptive query optimization -- interleaving execution and optimization, so as to always make the correct plan choices based on actual data characteristics</p>
 </li>

<li>
  <p>Self-managing cloud deployments for elastic big data -- clusters that can grow themselves and redistribute load, recover from failures, etc.</p>
</li>

<li>
  <p>Developing and analyzing new benchmarks for RDF and graph databases</p>
</li>

<li>
  <p>Embedding complex geospatial reasoning inside the database engine.  We have the basic R-tree and the OGC geometry data types; now we need to go beyond this</p>
</li>

<li>
  <p>Every type of SQL optimizer and execution engine trick that serves to optimize for TPC-H and DS.</p>
</li>
</ul>

<p>What do I mean by really good?  It boils down to being a smart and fast programmer.  We have over the years talked to people, including many who have worked on DBMS programming, and found that they actually know next to nothing of database science.  For example, they might not know what a hash join is.  Or they might not know that interprocess latency is in the tens of microseconds even within one box, and that in that time one can do tens of index lookups.  Or they might not know that blocking on a mutex kills.</p>

<p>If you do core database work, we want you to know how many CPU cache misses you will have in flight at any point of the algorithm, and how many clocks will be spent waiting for them at what points.  Same for distributed execution: The only way a cluster can perform is having max messages with max payload per message in flight at all times.</p>

<p>These are things that can be learned.  So I do not necessarily expect that you have in-depth experience of these, especially since most developer jobs are concerned with something else.  You may have to unlearn the bad habit of putting interfaces where they do not belong, for example.  Or to learn that if there is an interface, then it must pass as much data as possible in one go.</p>

<p>Talent is the key.  You need to be a self-starter with a passion for technology and have competitive drive.  These can be found in many guises, so we place very few limits on the rest.  If you show you can learn and code fast, we don&#39;t necessarily care about academic or career histories.  You can be located anywhere in the world, and you can work from home.  There may be some travel but not very much.</p>

<p>In the context of <a href="http://lod2.eu/" id="link-id0x7719ea0">EU FP7 projects</a>, we are working with some of the best minds in database, including <a href="http://homepages.cwi.nl/~boncz/" id="link-id0x21d67d80">Peter Boncz</a> of CWI and VU Amsterdam (MonetDB, VectorWise) and <a href="http://www.mpi-inf.mpg.de/~neumann/" id="link-id0x2192d900">Thomas Neumann</a> of Technical University of Munich (RDF3X, HYPER).  This is an extra guarantee that you will be working on the most relevant problems in database, informed by the results of the very best work to date.</p>

<p>For more background, please see the IEEE Computer Society <i>Bulletin of the Technical Committee on Data Engineering,</i> <a href="http://sites.computer.org/debull/A12mar/issue1.htm" id="link-id0x7ca1d20">Special Issue on Column Store Systems</a>.</p>

<p>All articles and references therein are relevant for the job.  Be sure to read the CWI work on run time optimization (ROX), cracking, and recycling.   Do not miss the many papers on architecture-conscious, cache-optimized  algorithms; see the VectorWise and MonetDB articles in the bulletin for extensive references.</p>

<p>If you are interested in an opportunity with us, we will ask you to do a little exercise in multithreaded, performance-critical coding, to be detailed in a blog post in a few days.  If you have done similar work in research or industry, we can substitute the exercise with a suitable sample of this, but only if this is core database code.</p>

<p>There is a dual message:  The challenges will be the toughest a very tough race can offer.  On the other hand, I do not want to scare you away prematurely.  Nobody knows this stuff, except for the handful of people who actually do core database work.  So we are  not limiting this call to this small crowd and will teach you on the job if you just come with an aptitude to think in algorithms and code fast.  Experience has pros and cons so we do not put formal bounds on this.  &quot;Just out of high school&quot; may be good enough, if you are otherwise exceptional.  Prior work in RDF or semantic web is not a factor.  Sponsorship of your M.Sc. or Ph.D. thesis, if the topic is in our line of work and implementation can be done in our environment, is a further possibility.  Seasoned pros are also welcome and will know the nature of the gig from the reading list.</p>

<p>We are aiming to fill the position(s) between now and October.</p>

<p>Resumes and inquiries can be sent to Hugh Williams, <a href="mailto:hwilliams@openlinksw.com?subject=2012-08%20Virtuoso%20Developer%20Application" id="link-id0x5dc1ac0">hwilliams@openlinksw.com</a>.  We will contact applicants for interviews.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-23#1715">
  <rss:title>IEEE publication of ?Virtuoso, a Hybrid RDBMS/Graph Column Store?</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-23T14:55:31Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">My article, Virtuoso, a Hybrid RDBMS/Graph Column Store (PDF), can be found in Volume 35, Number 1, March 2012 (PDF) of the Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (also known as the IEEE Data Engineering Bulletin). Abstract: We discuss applying column store techniques to both graph (RDF) and relational data for mixed workloads ranging from lookup to analytics in the context of the OpenLink Virtuoso DBMS. In so doing, we need to obtain the excellent memory efficiency, locality and bulk read throughput that are the hallmark of column stores while retaining low-latency random reads and updates, under serializable isolation. DBLP BibTeX Record &#39;journals/debu/Erling12&#39; (XML) @article{DBLP:journals/debu/Erling12, author = {Orri Erling}, title = {Virtuoso, a Hybrid RDBMS/Graph Column Store}, journal = {IEEE Data Eng. Bull.}, volume = {35}, number = {1}, year = {2012}, pages = {3-8}, ee = {http://sites.computer.org/debull/A12mar/vicol.pdf}, bibsource = {DBLP, http://dblp.uni-trier.de} }</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>My article, <b><i>Virtuoso, a Hybrid RDBMS/Graph Column Store</i></b> (<a href="http://sites.computer.org/debull/A12mar/vicol.pdf" id="link-id0x1f456350">PDF</a>), can be found in <a href="http://www.informatik.uni-trier.de/~ley/db/journals/debu/debu35.html" id="link-id0x8bf15d8">Volume 35, Number 1, March 2012</a> (<a href="http://sites.computer.org/debull/A12mar/A12MAR-CD.pdf" id="link-id0x20adffb8">PDF</a>) of the <i><a href="http://www.informatik.uni-trier.de/~ley/db/journals/debu/index.html" id="link-id0x1e133b50">Bulletin</a> of the <a href="http://dbpedia.org/resource/IEEE_Computer_Society" id="link-id0x1f46c998">IEEE Computer Society</a> <a href="http://tab.computer.org/tcde/" id="link-id0x1c3fce40">Technical Committee on Data Engineering</a></i> (also known as the <i><a href="http://www.informatik.uni-trier.de/~ley/db/journals/debu/index.html" id="link-id0x17d03718">IEEE Data Engineering Bulletin</a>)</i>.



</p>
<p>
<b>Abstract:</b>
</p>
<blockquote>
<i>We discuss applying column store techniques to both graph (RDF) and relational data for mixed workloads ranging from lookup to analytics in the context of the OpenLink Virtuoso DBMS. In so doing, we need to obtain the excellent memory efficiency, locality and bulk read throughput that are the hallmark of column stores while retaining low-latency random reads and updates, under serializable isolation.</i>
</blockquote>

<p>
<b>DBLP BibTeX Record &#39;journals/debu/Erling12&#39;</b> (<a href="http://dblp.uni-trier.de/rec/bibtex/journals/debu/Erling12.xml" id="link-id0x20356268">XML</a>)</p>
<blockquote>
 <pre><tt>@article{DBLP:journals/debu/Erling12,
  author    = {Orri Erling},
  title     = {Virtuoso, a Hybrid RDBMS/Graph Column Store},
  journal   = {IEEE Data Eng. Bull.},
  volume    = {35},
  number    = {1},
  year      = {2012},
  pages     = {3-8},
  ee        = {http://sites.computer.org/debull/A12mar/vicol.pdf},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
</tt>
 </pre></blockquote>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1713">
  <rss:title>ICDE 2012 (post 6 of 6) - Science Data Panel</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-17T19:38:20Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Stonebraker chaired a panel on the future of science data at ICDE 2012 last week. Other participants were Jeremy Kepner from MIT Lincoln Labs, Anastasia Ailamaki from EPFL, and Alex Szalay from Johns Hopkins University. This is the thrust of what was said, noted from memory. My comments follow after the synopsis. Jeremy Kepner: When Java was new we saw it as the coming thing and figured that in HPC we should find space for this. When MapReduce and Hadoop came along, we saw this as a sea change in parallel programming models. This was so simple literally anybody could make parallel algorithms whereas this was not so with MPI. Even parallel distributed arrays are harder. So MapReduce was a game changer, together with the cloud where anybody can get a cluster. Hardly a week passes without me having to explain to somebody in government what MapReduce and Hadoop are about.We have a lot of arrays and a custom database for them. But the arrays are sparse so this is in fact a triple store. Our users like to work in MATLAB, and any data management must run together with that. Of course, MapReduce is not a real scheduler, and Hadoop is not a real file system. For deployment, we must integrate real schedulers and make HDFS look like a file system to applications. The abstraction of a file system is something people like. Being able to skip a time-consuming data-ingestion process with a database is an advantage with file-based paradigms like Hadoop. If this is enhanced with the right scheduling features, this can be a good component in the HPC toolbox. Michael Stonebraker: Users of the data use math packages like R, MATLAB, SAS, SPSS, or similar. If business intelligence is about AVG, MIN, MAX, COUNT, and GROUP BY, science applications are much more diverse in their analytics. All science algorithms have an inner loop that resembles linear algebra operations like matrix multiplication. Data is more often than not a large array. There are some graphs in biology and chemistry, but the world is primarily rectangular. Relational databases can emulate sparse arrays but are 20x slower than a custom-made array database for dense arrays. And I will not finish without picking on MapReduce: I know of 2000-node MapReduce clusters. The work they do is maybe that of a 100-node parallel database. So if 2000 nodes is what you want to operate, be my guest. Science database is a zero billion dollar business. We do not expect to make money from the science market with SciDB, which by now works and has commercial services supplied by Paradigm 4, while the code itself is open source, which is a must for the science community. The real business opportunity is in the analytics needed by insurance and financial services in general, which are next to identical with the science use cases SciDB tackles. This makes the vendors pay attention. Alex Szalay: The way astronomy is done today is through surveys: a telescope scans through the sky and produces data. We have now for 10 years operated the Sloane Sky Survey and kept the data online. We have all the data, and complete query logs, available for anyone interested. When we set out to do this with Jim Gray, everybody found this a crazy idea, but it has worked out. Anastasia Ailamaki: We do not use SciDB. We find a lot of spatial use cases. Researchers need access to simulation results which are usually over a spatial model, like in earthquake simulations and the brain. Off-the-shelf techniques like R trees do not work -- the objects overlap too much -- so we have made our own spatial indexing. We make custom software when it is necessary, and are not tied to vendors. In geospatial applications, we can create meshes of different shapes -- like tetrahedral or cubes for earthquakes, and cylinders for the brain -- and index these in a geospatial index. But since an R tree is inefficient when objects overlap too much, as these do, we just find one; and then because there is reachability from an object to neighboring ones, we use this to get all the objects in the area of interest. * * * This is obviously a diverse field. Probably the message that we can synthesize out of this is that flexibility and parallel programming models are what we need to pay attention to. There is a need to go beyond what one can do in SQL while continuing to stay close to the data. Also, allowing for plug-in data types and index structures may be useful; we sometimes get requests for such anyway. The continuing argument around MapReduce and Hadoop is a lasting feature of the landscape. A parallel DB will beat MapReduce any day at joining across partitions; the problem is to overcome the mindset that sees Hadoop as the always-first answer to anything parallel. People will likely have to fail with this before they do anything else. For us, the matter is about having database-resident logic for extract-transform-load (ETL) that can do data-integration type-transformations and maybe iterative graph algorithms that constantly join across partitions, better than a MapReduce job, while still allowing application logic to be written in Java. Teaching sem-web-heads to write SQL procedures and to know about join order, join type, and partition locality, has proven to be difficult. People do not understand latency, whether in client-server or cluster settings. This is why they do not see the point of stored procedures or of shipping functions to data. This sounds like a terrible indictment, like saying that people do not understand why rivers flow downhill. Yet, it is true. This is also why MapReduce is maybe the only parallel programming paradigm that can be successfully deployed in the absence of this understanding, since it is actually quite latency-tolerant, not having any synchronous cross-partition operations except for the succession of the map and reduce steps themselves. Maybe it is so that the database guys see MapReduce as an insult to their intelligence and the rest of the world sees it as the only understandable way of running grep and sed (Unix commands for string search/replace) in parallel, with the super bonus of letting you reshuffle the outputs so that you can compare everything to everything else, which grep alone never let you do. * * * Making a database that does not need data loading seems a nice idea, and CWI has actually done something in this direction in &quot;Here are my Data Files. Here are my Queries. Where are my Results?&quot;] However, there is another product called Algebra Data that claims to take in data without loading and to optimize storage based on access. We do not have immediate plans in this direction. Bulk load is already quite fast (take 100G TPC-H in 70 minutes or so), but faster is always possible.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://dbpedia.org/resource/Michael_Stonebraker" class="absuri" id="link-id0x26c05b28">Michael Stonebraker</a> chaired a panel on the future of science data at <a href="http://www.icde12.org/Site/" class="absuri" id="link-id0x26d5d658">ICDE 2012</a> last week. Other participants were <a href="http://www.mit.edu/~kepner/" class="absuri" id="link-id0x26c05360">Jeremy Kepner</a> from <a href="http://www.mit.edu/" class="absuri" id="link-id0x26b12608">MIT</a> <a href="http://dbpedia.org/resource/Lincoln_Laboratory" class="absuri" id="link-id0x27d4f370">Lincoln Labs</a>, <a href="http://people.epfl.ch/cgi-bin/people?id=177957" class="absuri" id="link-id0x25259440">Anastasia Ailamaki</a> from <a href="http://dbpedia.org/resource/%C3%89cole_Polytechnique_F%C3%A9d%C3%A9rale_de_Lausanne" class="absuri" id="link-id0x252593b8">EPFL</a>, and <a href="http://www.sdss.jhu.edu/~szalay/" class="absuri" id="link-id0x26c067e8">Alex Szalay</a> from <a href="http://dbpedia.org/resource/Johns_Hopkins_University" class="absuri" id="link-id0x26c06678">Johns Hopkins University</a>.</p>
<p> This is the thrust of what was said, noted from memory. My comments follow after the synopsis.</p>
<b>Jeremy Kepner:</b> When <a href="http://dbpedia.org/resource/Java_(programming_language)" class="absuri" id="link-id0x26c052d8">Java</a> was new we saw it as the coming thing and figured that in <a href="http://dbpedia.org/resource/High-performance_computing" class="absuri" id="link-id0x26c06900">HPC</a> we should find space for this. When <a href="http://dbpedia.org/resource/MapReduce" class="absuri" id="link-id0x26d5d518">MapReduce</a> and <a href="http://dbpedia.org/resource/Apache_Hadoop" class="absuri" id="link-id0x27d6a330">Hadoop</a> came along, we saw this as a sea change in parallel programming models. This was so simple literally anybody could make parallel algorithms whereas this was not so with <a href="http://dbpedia.org/resource/Message_Passing_Interface" class="absuri" id="link-id0x25290350">MPI</a>. Even parallel distributed arrays are harder. So MapReduce was a game changer, together with the cloud where anybody can get a cluster. Hardly a week passes without me having to explain to somebody in government what MapReduce and Hadoop are about.<p>We have a lot of arrays and a custom database for them. But the arrays are sparse so this is in fact a triple store. Our users like to work in <a href="http://dbpedia.org/resource/MATLAB" class="absuri" id="link-id0x26b11980">MATLAB</a>, and any data management must run together with that.</p>
<p>Of course, MapReduce is not a real scheduler, and Hadoop is not a real file system. For deployment, we must integrate real schedulers and make <a href="http://dbpedia.org/resource/Hadoop_Distributed_File_System" class="absuri" id="link-id0x27d50258">HDFS</a> look like a file system to applications. The abstraction of a file system is something people like. Being able to skip a time-consuming data-ingestion process with a database is an advantage with file-based paradigms like Hadoop. If this is enhanced with the right scheduling features, this can be a good component in the HPC toolbox.</p>
<p> <b>Michael Stonebraker:</b> Users of the data use math packages like R, MATLAB, SAS, SPSS, or similar. If business intelligence is about <tt>AVG</tt>, <tt>MIN</tt>, <tt>MAX</tt>, <tt>COUNT</tt>, and <tt>GROUP BY</tt>, science applications are much more diverse in their analytics. All science algorithms have an inner loop that resembles linear algebra operations like matrix multiplication. Data is more often than not a large array. There are some graphs in biology and chemistry, but the world is primarily rectangular. Relational databases can emulate sparse arrays but are 20x slower than a custom-made array database for dense arrays. And I will not finish without picking on MapReduce: I know of 2000-node MapReduce clusters. The work they do is maybe that of a 100-node parallel database. So if 2000 nodes is what you want to operate, be my guest.</p>
<p> Science database is a zero billion dollar business. We do not expect to make money from the science market with <a href="http://www.scidb.org/" class="absuri" id="link-id0x25269080">SciDB</a>, which by now works and has commercial services supplied by Paradigm 4, while the code itself is open source, which is a must for the science community. The real business opportunity is in the analytics needed by insurance and financial services in general, which are next to identical with the science use cases SciDB tackles. This makes the vendors pay attention.</p>
<p> <b>Alex Szalay:</b> The way astronomy is done today is through surveys: a telescope scans through the sky and produces data. We have now for 10 years operated the Sloane Sky Survey and kept the data online. We have all the data, and complete query logs, available for anyone interested. When we set out to do this with Jim Gray, everybody found this a crazy idea, but it has worked out.</p>
<p> <b>Anastasia Ailamaki:</b> We do not use SciDB. We find a lot of spatial use cases. Researchers need access to simulation results which are usually over a spatial model, like in earthquake simulations and the brain. Off-the-shelf techniques like R trees do not work -- the objects overlap too much -- so we have made our own spatial indexing. We make custom software when it is necessary, and are not tied to vendors. In geospatial applications, we can create meshes of different shapes -- like tetrahedral or cubes for earthquakes, and cylinders for the brain -- and index these in a geospatial index. But since an R tree is inefficient when objects overlap too much, as these do, we just find one; and then because there is reachability from an object to neighboring ones, we use this to get all the objects in the area of interest.</p>
<p align="center">*                     *                     *</p>
<p>This is obviously a diverse field. Probably the message that we can synthesize out of this is that <b><i>flexibility and parallel programming models are what we need to pay attention to.</i></b> There is a need to go beyond what one can do in SQL while continuing to stay close to the data. Also, allowing for plug-in data types and index structures may be useful; we sometimes get requests for such anyway.</p>
<p>The continuing argument around MapReduce and Hadoop is a lasting feature of the landscape. A parallel DB will beat MapReduce any day at joining across partitions; the problem is to overcome the mindset that sees Hadoop as the always-first answer to anything parallel. People will likely have to fail with this before they do anything else. For us, the matter is about having database-resident logic for <a href="http://dbpedia.org/resource/Extract,_transform,_load" class="absuri" id="link-id0x26c06558">extract-transform-load (ETL)</a> that can do data-integration type-transformations and maybe iterative graph algorithms that constantly join across partitions, better than a MapReduce job, while still allowing application logic to be written in Java. Teaching sem-web-heads to write SQL procedures and to know about join order, join type, and partition locality, has proven to be difficult. People do not understand latency, whether in client-server or cluster settings. This is why they do not see the point of stored procedures or of shipping functions to data. This sounds like a terrible indictment, like saying that people do not understand why rivers flow downhill. Yet, it is true. This is also why MapReduce is maybe the only parallel programming paradigm that can be successfully deployed in the absence of this understanding, since it is actually quite latency-tolerant, not having any synchronous cross-partition operations except for the succession of the map and reduce steps themselves.</p>
<p>Maybe it is so that the database guys see MapReduce as an insult to their intelligence and the rest of the world sees it as the only understandable way of running <tt><a href="http://dbpedia.org/resource/Grep" class="absuri" id="link-id0x2507a000">grep</a></tt> and <tt><a href="http://dbpedia.org/resource/Sed" class="absuri" id="link-id0x25079ec0">sed</a></tt> (Unix commands for string search/replace) in parallel, with the super bonus of letting you reshuffle the outputs so that you can compare everything to everything else, which <tt>grep</tt> alone never let you do.</p>
<p align="center">*                     *                     *</p>
<p>Making a database that does not need data loading seems a nice idea, and CWI has actually done something in this direction in &quot;<a href="http://infoscience.epfl.ch/record/161489" class="absuri" id="link-id0x250787f0">Here are my Data Files. Here are my Queries. Where are my Results?</a>&quot;] However, there is another product called Algebra Data that claims to take in data without loading and to optimize storage based on access. We do not have immediate plans in this direction. Bulk load is already quite fast (take 100G TPC-H in 70 minutes or so), but faster is always possible.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1712">
  <rss:title>ICDE 2012 (post 5 of 6) - Graphs</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-17T19:38:15Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">There were quite a few talks about graphs at ICDE 2012. Neither the representations of graphs, nor the differences between RDF and generic graph models, entered much into the discussion. On the other hand, graph similarity searches and related were addressed a fair bit. Graph DB and RDF/Linked Data are distinct, if neighboring disciplines. On one hand, graph problems predate Linked Data, and the RDF/Linked Data world is a web artifact, which graphs are not as such, so a slightly different cultural derivation also makes these disjoint. Besides, graphs may imply schema first whereas linked data basically cannot. Then another differentiation might be derived from edges not really being first class citizens in RDF, except for reification, at which the RDF reification vocabulary is miserably inadequate, as pointed out before. RDF is being driven by the web-style publishing of Linked Open Data (LOD), with some standardization and uptake by publishers; Graph DB is not standardized but driven by diverse graph-analytics use cases. There is no necessary reason why these could not converge, but it will be indefinitely long before any standards come to cover this, so best not hold one&#39;s breath. Communities are jealous of their borders, so if the neighbor does something similar one tends to emphasize the differences and not the commonalities. So for some things, one could warehouse the original RDF of the web microformats and LOD, and then ETL into some other graph model for specific tasks, or just do these in RDF. Of course, then RDF systems need to offer suitable capabilities. These seem to be about very fast edge traversal within a rather local working set, and about accommodating large, iteratively-updated intermediate results, e.g., edge weights. Judging by the benchmarks paper (Benchmarking traversal operations over graph databases (Slidedeck (ppt), paper (pdf)); Marek Ciglan, Alex Averbuch, and Ladialav Hluchy.) at the GDM workshop, the state of benchmarking in graph databases is even worse than in RDF, where the state is bad enough. The paper&#39;s premise was flawed to start, using application logic to do JOINs instead of doing them in the DBMS. In this way, latency comes to dominate, and only the most blatant differences are seen. There is nothing like this style of benchmarking to make an industry look bad. The supercomputer Graph 500 benchmark, on the other hand, lets the contestants make their own implementations on a diversity of architectures with random traversal as well as loading and generating large intermediate results. It is somewhat limited, but still broader than the the graph database benchmarks paper at the GDM workshop. Returning to graphs, there were some papers on similarity search and clique detection. As players in this space, beyond just RDF, we might as well consider implementing necessary features for efficient expression of such problems. The algorithms discussed were expressed in procedural code against memory-based data structures; there is usually no query language or parallel/distributed processing involved. MapReduce has become the default way in which people would tackle such problems at scale; in fact, people do not consider anything else, as far as I can tell. Well, they certainly do not consider MPI for example as a first choice. The parallel array things in Fortran do not at first sight seem very graphy, so this is likely not something that crosses one&#39;s mind either. We should try some of the similarity search and clustering in SQL with a parallel programming model. We have excellent expression-evaluation speed from vectoring and unrestricted recursion between partitions, and no file system latencies like MapReduce. The initial test case will be some of the linking/data-integration/mapping workloads in LOD2. Having some sort-of-agreed-upon benchmark for these workloads would make this more worthwhile. Again, we will see what emerges.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>There were quite a few talks about graphs at <a href="http://www.icde12.org/Site/" class="absuri" id="link-id0x26d23c70">ICDE 2012</a>. Neither the representations of graphs, nor the differences between RDF and generic graph models, entered much into the discussion. On the other hand, graph similarity searches and related were addressed a fair bit.</p>
<p>Graph DB and RDF/Linked Data are distinct, if neighboring disciplines. On one hand, graph problems predate Linked Data, and the RDF/Linked Data world is a web artifact, which graphs are not as such, so a slightly different cultural derivation also makes these disjoint. Besides, graphs may imply schema first whereas linked data basically cannot. Then another differentiation might be derived from edges not really being first class citizens in RDF, except for reification, at which the RDF reification vocabulary is miserably inadequate, as pointed out before.</p>
<p>RDF is being driven by the web-style publishing of Linked Open Data (LOD), with some standardization and uptake by publishers; Graph DB is not standardized but driven by diverse graph-analytics use cases.</p>
<p>There is no necessary reason why these could not converge, but it will be indefinitely long before any standards come to cover this, so best not hold one&#39;s breath. Communities are jealous of their borders, so if the neighbor does something similar one tends to emphasize the differences and not the commonalities.</p>
<p>So for some things, one could warehouse the original RDF of the web microformats and LOD, and then ETL into some other graph model for specific tasks, or just do these in RDF. Of course, then RDF systems need to offer suitable capabilities. These seem to be about very fast edge traversal within a rather local working set, and about accommodating large, iteratively-updated intermediate results, e.g., edge weights.</p>
<p>Judging by the benchmarks paper (<i>Benchmarking traversal operations over graph databases (<a href="http://www.cse.unsw.edu.au/~iwgdm/2012/Slides/Ciglan.pptx" class="absuri" id="link-id0x26e6cc68">Slidedeck (ppt)</a>, <a href="http://ups.savba.sk/~marek/papers/gdm12-ciglan.pdf" class="absuri" id="link-id0x28545128">paper (pdf)</a>);</i> Marek Ciglan, Alex Averbuch, and Ladialav Hluchy.) at the <a href="http://www.cse.unsw.edu.au/~iwgdm/2012/index.html" class="absuri" id="link-id0x26b17d70">GDM workshop</a>, the state of benchmarking in graph databases is even worse than in RDF, where the state is bad enough. The paper&#39;s premise was flawed to start, using application logic to do <tt>JOIN</tt>s instead of doing them in the DBMS. In this way, latency comes to dominate, and only the most blatant differences are seen. There is nothing like this style of benchmarking to make an industry look bad. The supercomputer Graph 500 benchmark, on the other hand, lets the contestants make their own implementations on a diversity of architectures with random traversal as well as loading and generating large intermediate results. It is somewhat limited, but still broader than the the graph database benchmarks paper at the GDM workshop.</p>
<p> Returning to graphs, there were some papers on similarity search and clique detection. As players in this space, beyond just RDF, we might as well consider implementing necessary features for efficient expression of such problems. The algorithms discussed were expressed in procedural code against memory-based data structures; there is usually no query language or parallel/distributed processing involved.</p>
<p>
<a href="http://dbpedia.org/resource/MapReduce" class="absuri" id="link-id0x27d27f40">MapReduce</a> has become the default way in which people would tackle such problems at scale; in fact, people do not consider anything else, as far as I can tell. Well, they certainly do not consider MPI for example as a first choice. The parallel array things in Fortran do not at first sight seem very graphy, so this is likely not something that crosses one&#39;s mind either.</p>
<p>We should try some of the similarity search and clustering in SQL with a parallel programming model. We have excellent expression-evaluation speed from vectoring and unrestricted recursion between partitions, and no file system latencies like <nop></nop>MapReduce. The initial test case will be some of the linking/data-integration/mapping workloads in LOD2.</p>
<p> Having some sort-of-agreed-upon benchmark for these workloads would make this more worthwhile. Again, we will see what emerges.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1711">
  <rss:title>ICDE 2012 (post 4 of 6) - Graph Data Management Workshop</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-17T19:38:09Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I gave an invited talk (&quot;Virtuoso 7 - Column Store and Adaptive Techniques for Graph&quot; (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012. Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work. So, we could do it like Bigdata®, or we could add a &quot;quad ID&quot; column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG-&gt;R. Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database. I heard from Bryan that the new W3 RDF WG had declared provenance out of scope, unfortunately. The word on the street on the other hand is that provenance is increasingly found to be an issue. This is confirmed by the active work of the W3 Provenance Working Group.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I gave an invited talk (&quot;Virtuoso 7 - Column Store and Adaptive Techniques for Graph&quot; (<a href="http://www.cse.unsw.edu.au/~iwgdm/2012/Slides/Virtuoso.ppt" class="absuri" id="link-id0x27fce2f8">Slides (ppt)</a>)) at the <a href="http://www.cse.unsw.edu.au/~iwgdm/2012/" class="absuri" id="link-id0x27eb6cd8">Graph Data Management Workshop</a> at <a href="http://www.icde12.org/Site/" class="absuri" id="link-id0x26d89980">ICDE 2012</a>.</p>
<p>Bryan Thompson of <a href="http://www.systap.com/" class="absuri" id="link-id0x2a893000">Systap</a> (<a href="http://www.systap.com/bigdata.htm" class="absuri" id="link-id0x26c36d70">Bigdata®</a> RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to <a href="http://dbpedia.org/resource/SPARQL" class="absuri" id="link-id0x2a74b040">SPARQL</a>, and adding a way of <a href="http://dbpedia.org/resource/Reification_%28computer_science%29" class="absuri" id="link-id0x26c90790">reifying statements</a> that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF <tt>S</tt> or <tt>O</tt> (Subject or Object), less frequently a <tt>P</tt> or <tt>G</tt> (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the <tt>O</tt> but could of course have one in also <tt>S</tt> and even in <tt>P</tt> and <tt>G</tt>. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.</p>
<p>So, we could do it like Bigdata®, or we could add a &quot;quad ID&quot; column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of <tt>PSOG-&gt;R</tt>.</p>
<p>Yet another variation would be to make the <tt>SPOG</tt> concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the <tt>O</tt>, and as an IRI in a special range when occurring as <tt>S</tt>. The relative merits depend on how often something will be reified and on whether one wishes to <tt>SELECT</tt> based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.</p>
<p>I heard from Bryan that the new <a href="http://www.w3.org/2011/rdf-wg/" class="absuri" id="link-id0x28a628b0">W3 RDF WG</a> had declared provenance out of scope, unfortunately. The word on the street on the other hand is that provenance is increasingly found to be an issue. This is confirmed by the active work of the <a href="http://www.w3.org/2011/prov/" class="absuri" id="link-id0x2540a098">W3 Provenance Working Group</a>.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1710">
  <rss:title>ICDE 2012 (post 3 of 6) - What Is Timely LOD Search Worth?</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-17T19:38:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">There was a talk (Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces (Slides (PDF)); Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira, Axel Polleres and Manfred Hauswirth) at the Data Engineering Meets the Semantic Web (DESWEB) workshop at ICDE last week about the problems of caching LOD, whether attempted by Sindice or OpenLink&#39;s LOD Cloud Cache. The conclusion was that OpenLink covered a bit more of the test data sets and that Sindice was maybe better up to date on the ones that it covered but that neither did it very well. The data sets were random graphs of user FOAF profiles and such collected from some Billion Triples Data set, thus not data that is likely to have commercial value, except in huge quantities maybe for some advertising, except that click streams and the like are much more valuable. Being involved with at least one of these, and being in the audience, I felt obligated to comment. The fact is, neither OpenLink&#39;s LOD Cloud Cache nor Sindice is a business, and there is not a business model which could justify keeping them timely on the web crawls they contain. Doing so is easy enough, if there is a good enough reason. The talk did make a couple of worthwhile points: The data does change; and if one queries entities, one encounters large variation in change-frequency across entities and their attributes. The authors suggested to have a piece of middleware decide what things can be safely retrieved from a copy and what have to be retrieved from the source. Not too much is in fact known about the change frequency of the data, except that it changes, as the authors pointed out. The crux of the matter is that the thing that ought to know this best is the query processor at the LOD warehouse. For client-side middleware to split the query, it needs access to statistics that it must get from the warehouse or keep by itself. Of course, in concrete application scenarios, you go to the source if you ask about the weather or traffic jams, and otherwise go to the warehouse based on application-level knowledge. But for actual business intelligence, one needs histories, so a search engine with only the present is not so interesting. At any rate, refreshing the data should leave a trail of past states. Exposing this for online query would just triple the price, so we forget about that for now. Just keeping an append-only table of history is not too much of a problem. One may make extracts from this table into a relational form for specific business questions. There is no point doing such analytics in RDF itself. One would have to just try to see if there is anything remotely exploitable in such histories. Making a history table is easy enough. Maybe I will add one. Let us now see what it would take to operate a web crawl cache that would be properly provisioned, kept fresh, and managed. We base this on the Sindice crawl sizes and our experiments on these; the non-web-crawl LOD Cloud Cache is not included. From previous experience we know the sizing: 5Gt/144GB RAM. Today&#39;s best price point is on 24-DIMM E5 boards, so 192GB RAM, or 6.67Gt. A unit like that (8TB HDD, 0.5TB SSD, 192GB RAM, 12 core E5, InfiniBand) costs about $6800. The Sindice crawl is now about 20Gt, so $28K of gear (768GB RAM) is enough. Let us count this 4 times: 2x for anticipated growth; and 2x for running two copies -- one for online, and one for batch jobs. This is 3TB RAM. Power is 16x500W = 8KW, which we could round to 80A at 110V. Colocation comes to $500 for the space, and $1200 per month for power; make it $2500 per month with traffic included. At this rate, 3 year TCO is $120K + ( 36 * $2.5K ) = $210K. This takes one person half time to operate, so this is another $50K per year. We do not count software development in this, except some scripting that should be included in the yearly $50K DBA bill. Under what circumstances is such a thing profitable? Or can such a thing be seen as a marketing demo, to be paid for by license or service sales? A third party can operate a system of this sort, but then the cost will be dominated by software licenses if running on Virtuoso cluster. For comparison, the TB at EC2 costs ((( 16 * $2 ) * 24 ) * 31 ) = $24,808 per month. With reserved instances, it is ( 16 * ( $2192 + ((( 0.7 * 24 ) * 365 ) * 3 ))) / 36 = $8938 per month for a 3 year term. Counting at 3TB, the 3 year TCO is $965K at EC2. AWS has volume discounts but they start higher than this; ( 3 * ( 16 * $2K )) = $96K reserved host premium is under $250K. So if you do not even exceed their first volume discount threshold, it does not look likely you can cut a special deal with AWS. (The AWS prices are calculated with the high memory instances, approximately 64GB usable RAM each. The slightly better CC2 instance is a bit more expensive.) Yet another experiment to make is whether a system as outlined will even run at anywhere close to the performance of physical equipment. This is uncertain; clouds are not for speed, based on what we have seen. They make the most sense when the monthly bill is negligible in relation to the cost of a couple of days of human time.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>There was a talk (Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces (<a href="https://sites.google.com/site/desweb2012/parreira.pdf?attredirects=0" class="absuri" id="link-id0x27f2bbe0">Slides (PDF)</a>); Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira, Axel Polleres and Manfred Hauswirth) at the <a href="https://sites.google.com/site/desweb2012/" class="absuri" id="link-id0x299b5230">Data Engineering Meets the Semantic Web (DESWEB)</a> workshop at <a href="http://www.icde12.org/Site/" class="absuri" id="link-id0x2928bfd8">ICDE</a> last week about the problems of caching LOD, whether attempted by <a href="http://sindice.com/" class="absuri" id="link-id0x278876b8">Sindice</a> or <a href="http://www.openlinksw.com" class="absuri" id="link-id0x2a4ba140">OpenLink</a>&#39;s <a href="http://lod.openlinksw.com/" class="absuri" id="link-id0x299ed1f0">LOD Cloud Cache</a>. The conclusion was that OpenLink covered a bit more of the test data sets and that Sindice was maybe better up to date on the ones that it covered but that neither did it very well. The data sets were random graphs of user FOAF profiles and such collected from some Billion Triples Data set, thus not data that is likely to have commercial value, except in huge quantities maybe for some advertising, except that click streams and the like are much more valuable.</p>
<p>Being involved with at least one of these, and being in the audience, I felt obligated to comment. The fact is, neither OpenLink&#39;s LOD Cloud Cache nor Sindice is a business, and there is not a business model which could justify keeping them timely on the web crawls they contain. Doing so is easy enough, if there is a good enough reason.</p>
<p>The talk did make a couple of worthwhile points: The data does change; and if one queries entities, one encounters large variation in change-frequency across entities and their attributes.</p>
<p>The authors suggested to have a piece of middleware decide what things can be safely retrieved from a copy and what have to be retrieved from the source. Not too much is in fact known about the change frequency of the data, except that it changes, as the authors pointed out.</p>
<p>The crux of the matter is that the thing that ought to know this best is the query processor at the LOD warehouse. For client-side middleware to split the query, it needs access to statistics that it must get from the warehouse or keep by itself. Of course, in concrete application scenarios, you go to the source if you ask about the weather or traffic jams, and otherwise go to the warehouse based on application-level knowledge.</p>
<p>But for actual business intelligence, one needs histories, so a search engine with only the present is not so interesting. At any rate, refreshing the data should leave a trail of past states. Exposing this for online query would just triple the price, so we forget about that for now. Just keeping an append-only table of history is not too much of a problem. One may make extracts from this table into a relational form for specific business questions. There is no point doing such analytics in RDF itself. One would have to just try to see if there is anything remotely exploitable in such histories. Making a history table is easy enough. Maybe I will add one.</p>
<p>Let us now see what it would take to operate a web crawl cache that would be properly provisioned, kept fresh, and managed. We base this on the Sindice crawl sizes and our experiments on these; the non-web-crawl LOD Cloud Cache is not included.</p>
<p>From previous experience we know the sizing: 5Gt/144GB RAM. Today&#39;s best price point is on 24-DIMM E5 boards, so 192GB RAM, or 6.67Gt. A unit like that (8TB HDD, 0.5TB SSD, 192GB RAM, 12 core E5, <a href="http://dbpedia.org/page/InfiniBand" class="absuri" id="link-id0x27eb8158">InfiniBand</a>) costs about $6800.</p>
<p>The Sindice crawl is now about 20Gt, so $28K of gear (768GB RAM) is enough. Let us count this 4 times: 2x for anticipated growth; and 2x for running two copies -- one for online, and one for batch jobs. This is 3TB RAM. Power is 16x500W = 8KW, which we could round to 80A at 110V. Colocation comes to $500 for the space, and $1200 per month for power; make it $2500 per month with traffic included.</p>
<p>At this rate, 3 year TCO is <tt>$120K + ( 36 * $2.5K ) = $210K</tt>. This takes one person half time to operate, so this is another $50K per year.</p>
<p>We do not count software development in this, except some scripting that should be included in the yearly $50K DBA bill.</p>
<p>Under what circumstances is such a thing profitable? Or can such a thing be seen as a marketing demo, to be paid for by license or service sales?</p>
<p>A third party can operate a system of this sort, but then the cost will be dominated by software licenses if running on Virtuoso cluster.</p>
<p>For comparison, the TB at EC2 costs <tt>((( 16 * $2 ) * 24 ) * 31 ) = $24,808</tt> per month. With reserved instances, it is <tt>( 16 * ( $2192 + ((( 0.7 * 24 ) * 365 ) * 3 ))) / 36 = $8938</tt> per month for a 3 year term. Counting at 3TB, the 3 year TCO is $965K at EC2. AWS has volume discounts but they start higher than this; <tt>( 3 * ( 16 * $2K )) = $96K</tt> reserved host premium is under $250K. So if you do not even exceed their first volume discount threshold, it does not look likely you can cut a special deal with AWS.</p>
<p>(The AWS prices are calculated with the high memory instances, approximately 64GB usable RAM each. The slightly better CC2 instance is a bit more expensive.)</p>
<p>Yet another experiment to make is whether a system as outlined will even run at anywhere close to the performance of physical equipment. This is uncertain; clouds are not for speed, based on what we have seen. They make the most sense when the monthly bill is negligible in relation to the cost of a couple of days of human time.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1709">
  <rss:title>ICDE 2012 (post 2 of 6) - LOD Column Store Experiences and Sizing</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-17T19:37:58Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have played around with LOD data sets and Virtuoso Column Store for the past several months. I will here give a few numbers and comment on some different platform comparisons that we have made. The answer at the end of this is how to size a system for often-changing web-style data. The conclusion is a data-to-RAM ratio that gives an acceptable working set without driving the price up by forcing 100% RAM residence. The experiment is loading Sindice web crawls. The platform is 2 x Xeon 5520 and 144G RAM. The initial load rate is 200-180Kt, and drops to 100Kt at 5Gt because of I/O. The system is Virtuoso Column Store configured to run as 4 processes and 32 partitions, all on the same box. After 5Gt, we see just more I/O and going further is not relevant; one runs CPU-bound or not at all. We use 4 Crucial SSDs in the setup. The hot structures like the RDF quad indices are on SSD, and the cold ones are on hard disk. A cold structure is a write-only index like the dictionary of literals (id to lit). For bulk load, SSDs turn out not to be particularly useful. For a cold start on the other hand, SSDs cut warmup time of 144G RAM from over half an hour to a couple of minutes. It is possible that Intel SSDs would also help with bulk load, but this has not been tried. The SSD problem during bulk load is that these do not write very fast, and while there are writes in queue, read latency goes up; so under a constant write load, the SSD&#39;s famous instantaneous random read no longer works. The fragment considered in the example is 4.95Gt: 8.1M pages worth of quads; 12.7M of literals and iris; and 4.71M of full text index. A page is 8KB. The files on disk contain empty pages, but these do not matter since they do not take up RAM. The quad indices take 13.4 bytes/quad. The row-wise equivalent used to be 38 or so bytes/quad with similar data. Two-thirds of the IRI and literal string data can benefit from column-wise stream compression. (This was not used but if it were, we could count on a 50% drop in size for the data affected, so instead of 12.7M pages, we could maybe get 8.5M on a good day. This could be worth doing but is not a priority.) The system was configured to have 12M database pages in RAM, so a little under half the database pages of the set fit in RAM at one time; thus one cannot call this a memory-only setup. Due to the locality in the unusually non-local data, this is as far as secondary storage can reach without becoming an over-2x slowdown. In practice, we are talking about under 1% of rows accessed coming from secondary storage, but that alone means half throughput. We note that this data set represents the worst that we have seen. It has 129M distinct graphs, 38 t/g. Regular data like the synthetic benchmark sets take half the space per quad. This is about a third of a Sindice crawl; the other two-thirds look the same as far as we looked. So if you are interested in hosting data like this, you can budget 144GB RAM for every 5Gt. Do not try it with anything less. Budgeting double this is wise, so that you have space to cook the data; this is important since in order to do things with it, one needs to at least copy things for materializing transformations. If you are budget-constrained and hosting very regular content like UniProt, you can budget maybe 144GB RAM for every 10Gt. As for CPU, this does not matter as much as long as you do not go to disk. Just for load speed, Dbpedia is loaded in 300s on a cluster of eight (8) dual AMD 2378 boxes at 2.6GHz (total 8 cores per host, so 64 cores in the cluster), and in 945s on one (1) dual Xeon 5520 box at 2.26GHz (total 8 cores in the host). Intel makes much better CPUs, as we see. Both scenarios are 100% in RAM. For even more regular data, the load rates are a bit higher: 1.3Mt/s for the AMD cluster, and 300Kt/s for the Xeon host. The interconnect for the AMD cluster is 1 x gigE but this does not matter for load. For CPU-bound cross-partition JOINs, 1 or 2 x gigE is insufficient; 4 x gigE might barely make it; InfiniBand should be safe. When running cross-partition JOINs, a single 8-core Xeon box generates about 300MB/s of interconnect traffic; a gigE connection can maybe take 50MB/s with some luck. Intel E5 is not dramatically better than Nehalem but this is something we will see in a while when we make measurements with real equipment. Prior to the E5 release, we tried Amazon EC2 CC2 (&quot;Cluster Compute Eight Extra Large Instance&quot; -- 2x8 core E5, 2.66GHz). The results were inconclusive; it never did more than 1.9x better than Xeon 5520 even when running an empty loop (i.e., recursive Fibonacci function in SQL, no cache misses, no I/O). With a database JOIN, 1.3x better is the best we saw. But this must be the fault of Amazon and not of E5. We also tried AMD &quot;Magny-Cours&quot;, but for 32 cores against 8 it never did over 2x better, more like 1.4x often enough, and and single thread speed was 50% worse, so not a good buy. We did not find a Bulldozer to try, and did not feel like buying one since the reviews did not promise more core speed over the Magny-Cours. It seems that especially with Column Store, we are truly CPU-bound and not memory-latency- or bandwidth-bound. This is based on the observation that a Xeon 5620 with 2 of 3 memory channels populated loads BSBM data only 10% faster than the same with 1 of 3 channels populated, with CPU affinity set on a dual socket system. So, if you have a choice between a $2K processor (E5-2690) and a $600 processor (E5-2630), buy the cheaper one and get RAM with the money saved. $1440 buys 128G in $90 8G DIMMs. Then buy E5 boards with 24 DIMMs -- one for every 7Gt of web crawl data. If your software licenses are priced per core, getting higher-clock 4-core E5’s might make sense. While on the subject of bytes and quads/triples, we note that Bigdata®&#39;s recent announcement says up to 50 billion triples per single server. Franz loaded at a good 800+ Kt/s rate up to a trillion triples. One is led to think from the spec that this was with less than full cpu but still with highly local data, considering 1.5 bytes a triple would hit very heavy I/O otherwise. Their statement to the effect of LUBM-like data corroborates this, so we are not talking about exactly the same thing. So if you compare the claims, I am talking about running CPU-bound on the worst data there is. Franz and Bigdata® do not specify, so it is hard to compare. LOD2 should in principle publish actual metrics with at least Bigdata®; Franz is not participating in these races. We may publish some more detailed measurements with more varied configurations later. The thing to remember is minimum 144GB RAM for every 5Gt of web crawls, if you want to load and refresh in RAM.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have played around with <a href="http://lod.openlinksw.com/" class="absuri" id="link-id0x2a8a8df8">LOD data sets and Virtuoso Column Store</a> for the past several months. I will here give a few numbers and comment on some different platform comparisons that we have made. The answer at the end of this is how to size a system for often-changing web-style data. The conclusion is a data-to-RAM ratio that gives an acceptable working set without driving the price up by forcing 100% RAM residence.</p>
<p>The experiment is loading <a href="http://sindice.com/" class="absuri" id="link-id0x28e1fb38">Sindice</a> web crawls. The platform is 2 x Xeon 5520 and 144G RAM. The initial load rate is 200-180Kt, and drops to 100Kt at 5Gt because of I/O. The system is Virtuoso Column Store configured to run as 4 processes and 32 partitions, all on the same box. After 5Gt, we see just more I/O and going further is not relevant; one runs CPU-bound or not at all.</p>
<p>We use 4 Crucial SSDs in the setup. The hot structures like the RDF quad indices are on SSD, and the cold ones are on hard disk. A cold structure is a write-only index like the dictionary of literals (id to lit).</p>
<p>For bulk load, SSDs turn out not to be particularly useful. For a cold start on the other hand, SSDs cut warmup time of 144G RAM from over half an hour to a couple of minutes. It is possible that Intel SSDs would also help with bulk load, but this has not been tried. The SSD problem during bulk load is that these do not write very fast, and while there are writes in queue, read latency goes up; so under a constant write load, the SSD&#39;s famous instantaneous random read no longer works.</p>
<p>The fragment considered in the example is 4.95Gt: 8.1M pages worth of quads; 12.7M of literals and iris; and 4.71M of full text index. A page is 8KB. The files on disk contain empty pages, but these do not matter since they do not take up RAM. The quad indices take 13.4 bytes/quad. The row-wise equivalent used to be 38 or so bytes/quad with similar data. Two-thirds of the IRI and literal string data can benefit from column-wise stream compression. (This was not used but if it were, we could count on a 50% drop in size for the data affected, so instead of 12.7M pages, we could maybe get 8.5M on a good day. This could be worth doing but is not a priority.) The system was configured to have 12M database pages in RAM, so a little under half the database pages of the set fit in RAM at one time; thus one cannot call this a memory-only setup. Due to the locality in the unusually non-local data, this is as far as secondary storage can reach without becoming an over-2x slowdown. In practice, we are talking about under 1% of rows accessed coming from secondary storage, but that alone means half throughput.</p>
<p>We note that this data set represents the worst that we have seen. It has 129M distinct graphs, 38 t/g. Regular data like the synthetic benchmark sets take half the space per quad. This is about a third of a Sindice crawl; the other two-thirds look the same as far as we looked.</p>
<p>So if you are interested in hosting data like this, you can budget 144GB RAM for every 5Gt. Do not try it with anything less. Budgeting double this is wise, so that you have space to cook the data; this is important since in order to do things with it, one needs to at least copy things for materializing transformations.</p>
<p>If you are budget-constrained and hosting very regular content like <a href="http://dbpedia.org/page/UniProt" class="absuri" id="link-id0x29b465b8">UniProt</a>, you can budget maybe 144GB RAM for every 10Gt.</p>
<p>As for CPU, this does not matter as much as long as you do not go to disk. Just for load speed, Dbpedia is loaded in 300s on a cluster of eight (8) dual AMD 2378 boxes at 2.6GHz (total 8 cores per host, so 64 cores in the cluster), and in 945s on one (1) dual Xeon 5520 box at 2.26GHz (total 8 cores in the host). Intel makes much better CPUs, as we see. Both scenarios are 100% in RAM. For even more regular data, the load rates are a bit higher: 1.3Mt/s for the AMD cluster, and 300Kt/s for the Xeon host.</p>
<p>The interconnect for the AMD cluster is 1 x gigE but this does not matter for load. For CPU-bound cross-partition <tt>JOIN</tt>s, 1 or 2 x gigE is insufficient; 4 x gigE might barely make it; <a href="http://dbpedia.org/page/InfiniBand" class="absuri" id="link-id0x26d4cec8">InfiniBand</a> should be safe. When running cross-partition <tt>JOIN</tt>s, a single 8-core Xeon box generates about 300MB/s of interconnect traffic; a gigE connection can maybe take 50MB/s with some luck.</p>

<p>
<a href="http://en.wikipedia.org/wiki/Xeon#E5-16xx.2F26xx-series_.22Sandy_Bridge-EP.22" class="absuri" id="link-id0x27df5b40">Intel E5</a> is not dramatically better than <a href="http://en.wikipedia.org/wiki/Xeon#Nehalem-based_Xeon" class="absuri" id="link-id0x293e9d50">Nehalem</a> but this is something we will see in a while when we make measurements with real equipment. Prior to the E5 release, we tried Amazon EC2 CC2 (&quot;Cluster Compute Eight Extra Large Instance&quot; -- 2x8 core E5, 2.66GHz). The results were inconclusive; it never did more than 1.9x better than Xeon 5520 even when running an empty loop (i.e., recursive <a href="http://dbpedia.org/resource/Fibonacci_function" class="absuri" id="link-id0x2a480aa0">Fibonacci function</a> in SQL, no cache misses, no I/O). With a database <tt>JOIN</tt>, 1.3x better is the best we saw. But this must be the fault of Amazon and not of E5.</p>
<p>We also tried <a href="http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors#Opteron_6100-series_.22Magny-Cours.22_.2845_nm.29" class="absuri" id="link-id0x27e262a0">AMD &quot;Magny-Cours&quot;</a>, but for 32 cores against 8 it never did over 2x better, more like 1.4x often enough, and and single thread speed was 50% worse, so not a good buy. We did not find a <a href="http://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors#Bulldozer_based_Opterons" class="absuri" id="link-id0x29936f10">Bulldozer</a> to try, and did not feel like buying one since the reviews did not promise more core speed over the Magny-Cours.</p>
<p>It seems that especially with Column Store, we are truly CPU-bound and not memory-latency- or bandwidth-bound. This is based on the observation that a Xeon 5620 with 2 of 3 memory channels populated loads BSBM data only 10% faster than the same with 1 of 3 channels populated, with CPU affinity set on a dual socket system.</p>
<p>So, if you have a choice between a $2K processor (E5-2690) and a $600 processor (E5-2630), buy the cheaper one and get RAM with the money saved.
 $1440 buys 128G in $90 8G DIMMs. Then buy E5 boards with 24 DIMMs -- one for every 7Gt of web crawl data. If your software licenses are priced per core, getting higher-clock 4-core E5’s might make sense.</p>
<p>While on the subject of bytes and quads/triples, we note that <a href="http://www.systap.com/bigdata.htm" class="absuri" id="link-id0x2784a1d8">Bigdata®</a>&#39;s <a href="http://www.bigdata.com/bigdata/blog/?p=423" class="absuri" id="link-id0x2ac80e68">recent announcement</a> says up to 50 billion triples per single server. Franz loaded at a good 800+ Kt/s rate up to <a href="http://franz.com/agraph/allegrograph/agraph_benchmarks.lhtml" class="absuri" id="link-id0x26c3bea0">a trillion triples</a>. One is led to think from the spec that this was with less than full cpu but still with highly local data, considering 1.5 bytes a triple would hit very heavy I/O otherwise. Their statement to the effect of <a href="http://swat.cse.lehigh.edu/projects/lubm/" class="absuri" id="link-id0x251d4180">LUBM</a>-like data corroborates this, so we are not talking about exactly the same thing.</p>
<p>So if you compare the claims, I am talking about running CPU-bound on the worst data there is. Franz and Bigdata® do not specify, so it is hard to compare. LOD2 should in principle publish actual metrics with at least Bigdata®; Franz is not participating in these races.</p>
<p>We may publish some more detailed measurements with more varied configurations later. The thing to remember is minimum 144GB RAM for every 5Gt of web crawls, if you want to load and refresh in RAM.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2012-04-17#1708">
  <rss:title>ICDE 2012 (post 1 of 6) - LOD2 Plenary</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2012-04-17T19:37:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">LOD2&#39;s database contributions are, on one hand, Virtuoso Column Store and Elastic Cluster, and on the other, the demonstration and proof from CWI that indeed all of the relational innovations for which CWI is well known apply to graph/RDF data as well. The value is unquestionable both to Virtuoso users in the short-term, and to the state of science and to all RDF users and vendors in the mid-term. The LOD2 claim of &quot;linking the universe&quot; (my words) will be tested soon enough, after we first put the universe in a bucket. This refers to a real-time quad store of Sindice crawls, plus a warehouse of the LOD data sets. This effort raises a few questions that I will treat in a number of posts to follow, such as -- How do you size a real-time copy of LOD/web data? What does it cost to operate a properly provisioned warehouse of all RDF web crawls? What is done now is under-provisioned and not kept up to date. We are talking about all the RDF on the web in near real time with arbitrary queries. This is very far from the &quot;billion triples&quot; data sets or vertical portals, which are both easy by comparison.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://lod2.eu/" class="absuri" id="link-id0x1d3c5620">LOD2</a>&#39;s database contributions are, on one hand, Virtuoso Column Store and Elastic Cluster, and on the other, the demonstration and proof from <a href="http://dbpedia.org/resource/Centrum_Wiskunde_&amp;_Informatica" class="absuri" id="link-id0x11356250">CWI</a> that indeed all of the <a href="http://dbpedia.org/resource/Relational_database" class="absuri" id="link-id0x1d6296d8">relational</a> innovations for which CWI is well known apply to <a href="http://dbpedia.org/resource/Graph_%28data_structure%29" class="absuri" id="link-id0x1bcd83d8">graph</a>/<a href="http://dbpedia.org/resource/Resource_Description_Framework" class="absuri" id="link-id0x1befd4a0">RDF</a> data as well.</p>

<p>The value is unquestionable both to Virtuoso users in the short-term, and to the state of science and to all RDF users and vendors in the mid-term.</p>
<p>The LOD2 claim of &quot;<a href="http://www.openlinksw.com/weblog/oerling/?id=1649" class="absuri" id="link-id0x1c161d88">linking the universe</a>&quot; (my words) will be tested soon enough, after we first put the universe in a bucket.
 This refers to a real-time quad store of <a href="http://sindice.com/" class="absuri" id="link-id0x1d3eb050">Sindice</a> crawls, plus a warehouse of the LOD data sets.</p>
<p>This effort raises a few questions that I will treat in a number of posts to follow, such as --</p>
<ul>
<li>How do you size a real-time copy of LOD/web data? </li>
<li>What does it cost to operate a properly provisioned warehouse of all RDF web crawls?</li>
</ul>
<p>What is done now is under-provisioned and not kept up to date. We are talking about all the RDF on the web in near real time with arbitrary queries. This is very far from the &quot;billion triples&quot; data sets or vertical portals, which are both easy by comparison.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-09-30#1700">
  <rss:title>LOD2 Plenary and Review: Semanticist, Think Database!</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-09-30T21:02:01Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Last week the LOD2 FP7 project had its first review, preceded by its third plenary meeting. Before this, we did, as promised, get the column store and vectored execution capabilities of Virtuoso 7 Single-Server Edition extended to Virtuoso 7 Cluster Edition. More interesting still, we decoupled storage from the database server process, so now database files can migrate between server processes. This means that clusters are now elastic, i.e., new servers can be added to a cluster and the load can be redistributed without reloading the data. These things were long planned, but now are done. Measurements will be published in some weeks, as part of CWI&#39;s continued running of RDF store benchmarks, per the LOD2 plan. Doing the column store and elastic cluster is work enough, so I do not in general participate in support or consultancy or the like. This has some pros and cons. On the plus side, there is a relative lack of noise and a very clear idea of focus. Of course, this work is most highly applied, thus always informed by use cases, thus forgetting what ought to be done out there is not the problem. Rather, the problem is forgetting how things in fact are done as opposed to how they could or should be done. To cut a long story short, it has become clear to me that the DBMS must tell the application developer what to do. Of course, the application developer could also look at performance metrics, but they do not, and explaining these metrics is too much work and yields no lasting benefit. Developers will produce all kinds of performance diagnostic traces if requested, but going through this song and dance can also be avoided by the right automation. So, I will introduce two new product features called Wazzup? and Saywhat? Wazzup? is answered by a mood line, like &quot;Heavily disk bound: 100G more memory will give 10x speedup&quot; or &quot;Network bound: Processing in larger batches will give 5x more throughput&quot; and Saywhat? is answered by some commentary on the user&#39;s last action, for example &quot;there is no ?order with o_totalprice &lt; 0&quot; or &quot;there is no property O_misspelledtotallprrice.&quot; Wazzup? is about overall system state, and Saywhat? is about the user session, specifically query plans. But an explanation of a query plan is not understandable, so this will just point out some salient facts, like the reason why the answer comes out empty. The other thing that came to my attention is the fact that a user has no instinctive feel for ETL. A database person takes it for a self-evident truth that data is loaded in bulk, but the application developer does not think of that. Likewise, the line between warehousing and federating is not instinctively felt; actually the question is not even posed in these terms. So one will find Web protocols and end-points and glue code on the app server when one ought to have ETL and adequate hardware for running the consolidated database. Further, under-provisioning of equipment is endemic with semanticists. The Semantic Web gets a needlessly bad rap just because we find too much data on too little equipment. For example, I was surprised to learn that the Linked Geodata demo ran on only 16 GB RAM and 6 processor cores with 2 billion triples and 350 million points in a geo index. Now, even with our greatest space efficiency advances, there is no way this will run from memory. It is not that the Web 2.0 stack is necessarily efficient (we hear the wildest stories of lack of database understanding from that side too), but at least there is a culture of running with enough equipment. Surely when the web-scale data gear (e.g. Google Bigtable, Yahoo PNUTS, Amazon Dynamo) was new, by the operators&#39; own admission there was no way for this to be particularly efficient, database-wise. Not if your eventual consistency is a client application to a shared MySQL back-end. For a lookup or single-record-update workload, who cares when there is enough hardware? For analytics, there is the de facto impossibility of doing big joins, but map reduce is for that, all offline. The big web houses have always known how to deal with data; it is the smaller Web 2.0 guys who patch systems together with duct tape and memcache. Even so, the online experience gets created. Semanticism has no part of this outlook, except maybe for Freebase, but then they are from California and now have been inside Google for a while. We quite understand that when one needs to get big data online, one makes a key-value store as a point solution, because this way one owns what one operates, and the time to market is a lot shorter than if one tried building all this inside a general-purpose DBMS. Besides, the people who can in fact do this almost do not exist, and even if one had a whole army of this rare breed, development is not very scalable in a tightly-integrated system like a high-performance DBMS. Still further, to even start, one needs to own the DBMS, meaning that the initial platform must be known through and through. This is an issue even though open source platforms exist. The graph data, semdata, schema-last, RDF, linked data enterprise -- whatever one calls it -- makes the bold proposition of bringing complex-query-at-scale to heterogeneous data. This is a database claim. In the meantime, test deployments are made in defiance of database best practices. This is a bit like test driving a race car in reverse gear and steering by looking in the rear-view mirror. There is also no short-term scalable way to educate people. At the LOD2 review, one comment was that an integrated project ought to clearly indicate how to set up the tool chain for good performance, specially as concerns interfaces between the tools. This is very true. Experience shows that developers of tools cannot accurately anticipate what usage patterns will emerge in the field. Therefore, we propose to do better than just documentation; we will make the server recognize the common sources of inefficiency and point the user to the right action. Provisioning and usage patterns: The DBMS ought to know best. Imagine the following conversation: DBMS: Your application does single-triple INSERTs over client-server protocol all day, from a single client. 57% of real time goes in client server latency, 40% in cluster interconnect latency, 2% in compiling the statements, and 1% in doing the work. Use array parameters or bulk load from a file. Operator: My developers use industry-standard Java class libraries with a service-oriented architecture and strictly enforced interfaces. This is called software engineering. Watch out ere you raise your voice against the canon. [Some weeks later, after the load job has gone on for 10 days and gotten a third of the way, developers have discovered that JDBC has array parameters and are trying these.] DBMS: 60% of real time goes into waiting for locks. 10% of transactions get aborted for deadlock. Transactions consist of an average of 10 client-server operations. Use stored procedures; acquire locks in predictable order; do SELECT FOR UPDATE. Throughput will be 4x higher if client-server operations are merged into a single operation. The transactions only INSERT; hence consider bulk load instead. Operator: We are using an enterprise-class three-tier architecture. It has &quot;enterprise&quot; in the name and all the big guys are using it, so it must be scalable. Besides, it is distributed transactions, and distributed computing is the wave of the future. You are a cluster yourself, so the pot&#39;s got no business calling the kettle black. [After a while, the data gets loaded with bulk load, but now on a single stream.] DBMS: CPU is at 400% for an INSERT workload; adding more parallel threads will get 4.5x better throughput. [Some time has elapsed and there are Ajax client apps out there trying to use the data.] DBMS: Will you really not give me another 140 GB RAM and 16 more cores? Operator: No, on general principles I will not, shut up. DBMS: Do you know that your page impression takes 3 seconds and anything over 0.25 seconds is visibly slow? 300 GB worth of distinct pages have been accessed in the last 24 hours for 160 GB of RAM. Latency will drop 10x by using SSD; 50x by increasing RAM. Operator: No dice, bucket. Shut up, besides, when I scroll through the data I always use for testing, I get it fast enough, you are just doing this out of greed and self-importance. You are a server among many, just like the mail server; you databases are just pretentious. Currently addressing any of the above sorts of issues takes a long time and involves mostly-avoidable support communication. Questions of this sort do occur. We can probably produce commentary like the above based on logging some 50 numbers, and making some 15 regularly-run reports over these. The patterns to watch out for are well known. No, we will not make a Zippy the Pinhead office assistant; a computer should not try to be cute. This one will talk only in terms of gains from adjusting the deployment or usage patterns. Now, suppose the operator said yes to the request for more cores and memory; then it would be up to the DBMS to deliver. This entails a capacity to redistribute itself automatically, and to give a quantitative report on the success of this measure. This means usage-based repartitioning of the data to equalize load over a cluster. The relevant metric in the above case is the drop in response time. On the other hand, the DBMS should also notice if there is clearly unused capacity. This all will be presented as a line in the status report, so there is no extra wizard or workload analyzer that one must remember to run. For programmatic use there are SQL views for the relevant reports. As for ETL, even if the DBMS can detect that it is not being done right, this does not mean that the user will know what to do. Therefore, for all the Web harvesting we support, as well as any import from local file system or Web services, with some RDF-ization, we will simply implement a proper ETL utility that will do things right. Wazzup? can just point the user to that if the workload looks like loading. This will have its own status report giving a load and transform rate and will point out what takes the longest, after everything is duly parallelized and made asynchronous. Beyond these lessons, there is more to say about the review and plenary, we will get to that a bit later. We did promise a new edition of the LOD cache in a couple of months, now on the clustered column-store platform. Look for advances in data discoverability.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Last week the <a href="http://lod2.eu/" id="link-id0x57d7368">LOD2 FP7 project</a> had its first review, preceded by its third plenary meeting.</p>

<p>Before this, we did, <a href="http://www.openlinksw.com/weblog/oerling/?id=1683" id="link-id0x579c950">as promised</a>, get the column store and vectored execution capabilities of Virtuoso 7 Single-Server Edition extended to Virtuoso 7 Cluster Edition. More interesting still, we decoupled storage from the database server process, so now database files can migrate between server processes. This means that clusters are now elastic, i.e., new servers can be added to a cluster and the load can be redistributed without reloading the data.</p>

<p>These things were long planned, but now are done. Measurements will be published in some weeks, as part of CWI&#39;s continued running of RDF store benchmarks, per the LOD2 plan.</p>

<p>Doing the column store and elastic cluster is work enough, so I do not in general participate in support or consultancy or the like. This has some pros and cons. On the plus side, there is a relative lack of noise and a very clear idea of focus. Of course, this work is most highly applied, thus always informed by use cases, thus forgetting what ought to be done out there is not the problem. Rather, the problem is forgetting how things in fact <i>are</i> done as opposed to how they <i>could or should be</i> done.</p>

<p>To cut a long story short, it has become clear to me that the DBMS must tell the application developer what to do. Of course, the application developer could also look at performance metrics, but they do not, and explaining these metrics is too much work and yields no lasting benefit. Developers will produce all kinds of performance diagnostic traces if requested, but going through this song and dance can also be avoided by the right automation.</p>

<p>So, I will introduce two new product features called <i><b>Wazzup?</b></i> and <i><b>Saywhat?</b></i>
</p>

<p>
<b>Wazzup?</b> is answered by a mood line, like &quot;Heavily disk bound: 100G more memory will give 10x speedup&quot; or &quot;Network bound: Processing in larger batches will give 5x more throughput&quot; and <b>Saywhat?</b> is answered by some commentary on the user&#39;s last action, for example &quot;there is no ?order with o_totalprice &lt; 0&quot; or &quot;there is no property O_misspelledtotallprrice.&quot;</p>

<p>
<b>Wazzup?</b> is about overall system state, and <b>Saywhat?</b> is about the user session, specifically query plans. But an explanation of a query plan is not understandable, so this will just point out some salient facts, like the reason why the answer comes out empty.</p>

<p>The other thing that came to my attention is the fact that a user has no instinctive feel for <a href="http://dbpedia.org/page/Extract,_transform,_load" id="link-id0x527eb88">ETL</a>. A database person takes it for a self-evident truth that data is loaded in bulk, but the application developer does not think of that. Likewise, the line between warehousing and federating is not instinctively felt; actually the question is not even posed in these terms. So one will find Web protocols and end-points and glue code on the app server when one ought to have ETL and adequate hardware for running the consolidated database.</p>

<p>Further, under-provisioning of equipment is endemic with semanticists. The Semantic Web gets a needlessly bad rap just because we find too much data on too little equipment. For example, I was surprised to learn that the Linked Geodata demo ran on only 16 GB RAM and 6 processor cores with 2 billion triples and 350 million points in a geo index. Now, even with our greatest space efficiency advances, there is no way this will run from memory.</p>

<p>It is not that the Web 2.0 stack is necessarily efficient (we hear the wildest stories of lack of database understanding from that side too), but at least there is a culture of running with enough equipment. Surely when the web-scale data gear (e.g. Google Bigtable, Yahoo PNUTS, Amazon Dynamo) was new, by the operators&#39; own admission there was no way for this to be particularly efficient, database-wise. Not if your eventual consistency is a client application to a shared MySQL back-end. For a lookup or single-record-update workload, who cares when there is enough hardware? For analytics, there is the <i>de facto</i> impossibility of doing big joins, but map reduce is for that, all offline. The big web houses have always known how to deal with data; it is the smaller Web 2.0 guys who patch systems together with duct tape and memcache. Even so, the online experience gets created.</p>

<p>Semanticism has no part of this outlook, except maybe for Freebase, but then they are from California and now have been inside Google for a while.</p>

<p>We quite understand that when one needs to get big data online, one makes a key-value store as a point solution, because this way one owns what one operates, and the time to market is a lot shorter than if one tried building all this inside a general-purpose DBMS. Besides, the people who can in fact do this almost do not exist, and even if one had a whole army of this rare breed, development is not very scalable in a tightly-integrated system like a high-performance DBMS. Still further, to even start, one needs to own the DBMS, meaning that the initial platform must be known through and through. This is an issue even though open source platforms exist.</p>

<p>The graph data, semdata, schema-last, RDF, linked data enterprise -- whatever one calls it -- makes the bold proposition of bringing complex-query-at-scale to heterogeneous data. This is a database claim.</p>

<p>In the meantime, test deployments are made in defiance of database best practices. This is a bit like test driving a race car in reverse gear and steering by looking in the rear-view mirror.</p>

<p>There is also no short-term scalable way to educate people. At the LOD2 review, one comment was that an integrated project ought to clearly indicate how to set up the tool chain for good performance, specially as concerns interfaces between the tools. This is very true. Experience shows that developers of tools cannot accurately anticipate what usage patterns will emerge in the field. Therefore, we propose to do better than just documentation; we will make the server recognize the common sources of inefficiency and point the user to the right action.</p>

<h3>Provisioning and usage patterns: The DBMS ought to know best.</h3>

<p>Imagine the following conversation:</p>

<p>
<b>DBMS:</b> Your application does single-triple INSERTs over client-server protocol all day, from a single client. 57% of real time goes in client server latency, 40% in cluster interconnect latency, 2% in compiling the statements, and 1% in doing the work. Use array parameters or bulk load from a file.</p>

<p>
<b>Operator:</b> My developers use industry-standard Java class libraries with a service-oriented architecture and strictly enforced interfaces. This is called software engineering. Watch out ere you raise your voice against the canon.</p>

<p>
<i>[Some weeks later, after the load job has gone on for 10 days and gotten a third of the way, developers have discovered that JDBC has array parameters and are trying these.]</i>
</p>

<p>
<b>DBMS:</b> 60% of real time goes into waiting for locks. 10% of transactions get aborted for deadlock. Transactions consist of an average of 10 client-server operations. Use stored procedures; acquire locks in predictable order; do SELECT FOR UPDATE. Throughput will be 4x higher if client-server operations are merged into a single operation. The transactions only INSERT; hence consider bulk load instead.</p>

<p>
<b>Operator</b>: We are using an enterprise-class three-tier architecture. It has &quot;enterprise&quot; in the name and all the big guys are using it, so it must be scalable. Besides, it is distributed transactions, and distributed computing is the wave of the future. You are a cluster yourself, so the pot&#39;s got no business calling the kettle black.</p>

<p>
<i>[After a while, the data gets loaded with bulk load, but now on a single stream.]</i>
</p>

<p>
<b>DBMS</b>: CPU is at 400% for an INSERT workload; adding more parallel threads will get 4.5x better throughput.</p>

<p>
<i>[Some time has elapsed and there are Ajax client apps out there trying to use the data.]</i>
</p>

<p>
<b>DBMS</b>: Will you really not give me another 140 GB RAM and 16 more cores?</p>

<p>
<b>Operator</b>: No, on general principles I will not, shut up.</p>

<p>
<b>DBMS</b>: Do you know that your page impression takes 3 seconds and anything over 0.25 seconds is visibly slow? 300 GB worth of distinct pages have been accessed in the last 24 hours for 160 GB of RAM. Latency will drop 10x by using SSD; 50x by increasing RAM.</p>

<p>
<b>Operator</b>: No dice, bucket. Shut up, besides, when I scroll through the data I always use for testing, I get it fast enough, you are just doing this out of greed and self-importance. You are a server among many, just like the mail server; you databases are just pretentious.</p>

<p>Currently addressing any of the above sorts of issues takes a long time and involves mostly-avoidable support communication. Questions of this sort do occur. We can probably produce commentary like the above based on logging some 50 numbers, and making some 15 regularly-run reports over these. The patterns to watch out for are well known. No, we will not make a Zippy the Pinhead office assistant; a computer should not try to be cute. This one will talk only in terms of gains from adjusting the deployment or usage patterns.</p>

<p>Now, suppose the operator said <i>yes</i> to the request for more cores and memory; then it would be up to the DBMS to deliver. This entails a capacity to redistribute itself automatically, and to give a quantitative report on the success of this measure. This means usage-based repartitioning of the data to equalize load over a cluster. The relevant metric in the above case is the drop in response time. On the other hand, the DBMS should also notice if there is clearly unused capacity.</p>

<p>This all will be presented as a line in the status report, so there is no extra wizard or workload analyzer that one must remember to run. For programmatic use there are SQL views for the relevant reports.</p>

<p>As for ETL, even if the DBMS can detect that it is not being done right, this does not mean that the user will know what to do. Therefore, for all the Web harvesting we support, as well as any import from local file system or Web services, with some RDF-ization, we will simply implement a proper ETL utility that will do things right. <b>Wazzup?</b> can just point the user to that if the workload looks like loading. This will have its own status report giving a load and transform rate and will point out what takes the longest, after everything is duly parallelized and made asynchronous.</p>

<p>Beyond these lessons, there is more to say about the review and plenary, we will get to that a bit later. We did promise a new edition of the LOD cache in a couple of months, now on the clustered column-store platform. Look for advances in data discoverability.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-07-26#1697">
  <rss:title>GDB for the Data Driven Age (STI Summit Position Paper)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-07-26T13:37:26Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Note: The following was written prior to the event, but was not posted until later due to human error. The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details. GDB for the Data Driven Age Databases and knowledge representation both have decades of history, but to date the exchange of ideas and techniques between these disciplines has been limited. The intuition that there would be value in greater cooperation has not failed to occur to researchers on either side; after all, both sides deal with data. From this, we have seen deductive databases emerge, as well as more recently &quot;database friendly&quot; profiles of OWL. In this position paper we will examine what, in the most concrete terms, is needed in order to bring leading edge database technology together with expressive querying and reasoning. This draws on our experience in building Virtuoso, one of today&#39;s leading graph data stores. Following this, we argue for the creation of benchmarks and challenges that in fact do reflect reality and facilitate open and fair comparison of products and technologies. Data integration is often mentioned as the motivating use case for GDB, commonly popularized today as RDF. Database research has over the past few years produced great advances for business intelligence (i.e., complex queries and read-mostly workloads). These advances are typified by compressed columnar storage and architecture-conscious execution models, mostly based on the idea of always processing multiple sets of values in each operation (vectoring). With these techniques, raw performance with relatively simple schemas and regular data (e.g., TPC-H) is no longer a barrier to extracting value from data. A similar breakthrough has not been seen on the semantics side. Data integration still requires manual labor. Publishing GDB datasets is a good and necessary intermediate stage, but producing these datasets from diverse sources is not fundamentally different from doing the same work without GDB or RDF. Even so, GDB and RDF serve as a catalyst for a culture of publishing datasets. GDB, as a base model for integration, offers the following benefits over a purely relational result format: All entities have globally unique identifiers. Any statements may be associated ad hoc to any entities. These statements can be scoped into graphs according to their provenance, time, validity, etc. Obtaining this flexibility on a relational basis would simply require moving to an graph-like representation with essentially one-row-per-attribute. Indeed, we see key-value stores being used in online applications with high volatility of schema (e.g., social networks, search); and we also see relational applications making provisions for post-hoc addition of per-entity attributes (i.e., associating a bag of mixed non-first normal form data with entities). The benefits of a schema-last approach are recognized in many places. GDB seems a priori a fit for all these requirements, thus how will it claim its place as a solution? The first part of the answer lies in learning all the relevant database lessons. The second part lies in eliminating the impedance mismatch between querying and reasoning. The third and most important part consists of substantiating these claims in a manner that is understandable to the relevant publics, finally leading to the creation of a semantics-aware segment of the database industry. We will address each of these aspects in turn. GDB and RDB The problem is divided into storage format, execution, and query optimization. For the first two, Daniel Abadi&#39;s renowned Ph.D. thesis holds most of the keys. Space efficiency is specially important for Linked Data, since data is often voluminous, and many datasets have to be brought together for integration. Access patterns are also unpredictable, with indexed-random-access predominating, as opposed to RDB BI workloads where sequential scans and hash joins represent the bulk of the work. However, we find that a sorted column-wise compressed representation of Linked Data with a single quad table for all statements gives excellent space efficiency and good random access as well as random insert speed. The space efficiency is close to par with the equivalent column-wise relational format, since three of the four columns of the quad table compress to almost nothing. As many sort orders as are necessary may be maintained, but we find that two are enough, with some extra data structures for dealing with queries where the predicate is unspecified. The details are found in VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data. Since GDB/RDF is a model typed at run time, the engine must support an &quot;ANY&quot; data type for columns and query variables, where values on successive rows may be of different types. This is a straightforward enhancement. Vectored execution is traditionally associated with column stores because the per-row access cost is relatively high, thus needing to access many nearby rows at a time in order to amortize the overhead. Aside this, vectored execution provides many opportunities for parallelism, from the instruction level all the way to threading and distributed execution on clusters, thus some form of execution on large numbers of concurrent query states is needed for RDF stores, just as it is needed for RDBMS&quot;s. Query optimization for GDBMS is similar to that for RDBMS, except that the statistics can no longer be collected by column and table, but must rather apply to individual entities and ranges of a single quad table. This can be provided through run-time sampling of the database based on constants in the query being optimized. This may take into account trivial inference such as expanding properties into the set of their sub-properties and the like. Beyond this, interleaving execution and optimization (as in ROX) seems to offer limitless possibilities, especially when inference is introduced, making optimizer statistics less predictive. In summary, starting with an RDBMS and going to GDB entails changes to all parts of the engine, but these changes are not fundamental. One does need to own the engine; however, otherwise the expertise for efficiently implementing these changes will not exist. Essentially any DBMS technique may be translated to a GDB use case, if its application can be decided at run-time. GDB may be schema-less, yet most datasets have fairly regular structure; the question is simply to reconstruct the needed statistics and schema information from the data on an as you go basis. Techniques with high up-front cost, like constructing specially ordered materializations for optimizing specific queries, are harder to deploy but still conceivable for GDB also. RDB and Inference Compared to the straightforwardly performance oriented world of database engines, the contours of the landscape become less defined when moving to inference. Databases, whether relational or schema-less all perform roughly the same functions but inference is more diverse. We include here also techniques like machine learning and meta-reasoning for guiding reasoning, although these might not strictly fit the definition. As we posit that data integration is the motivating use case for GDB as opposed to RDB (Relational Database Model), we must ask which modes of inference are actually required for data integration. Further, we need to ask whether these inferences ought to be applied as a preprocessing step (ETL or forward chaining), or as needed (backward chaining). Some low-hanging fruit can be collected by simply constructing class or property hierarchies; e.g., in the data at hand, the following properties have the meaning of company name, and the following classes have the meaning of company. We have found that such techniques can be efficiently supported at run-time, without materialization, if the support is simply built into the engine, which is in itself straightforward as long as one controls the engine. The same applies to trivial identity resolution, such as owl:sameAs or resolution of identity based on sharing an inverse-functional property value. These things take longer at run-time, but if one caches and reuses the result, one can get around materialization. We do not believe in weak statements of identity, as in X is similar to Y, since the meaning of similarity is entirely contextual. X and Y may or may not be interchangeable depending on the application; thus the statement on identity needs to be strong, but it must be easy to modify the grounds on which such a statement is made. This is a further argument for why one should not automatically materialize consequences of identity, particularly if dealing with web data where identity is especially problematic. Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling (&quot;address as many columns in customer table&quot; vs. &quot;address normalized away under a contact entity&quot;), normalization (&quot;first name&quot; and &quot;last name&quot; as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world! If data exists, the conversion questions are often answerable but their answer depends on context -- e.g., date of transaction for currency exchange rate; source of data for the definition of blood pressure. Alongside these, there remain issues of identity, e.g., depending on the perspective, a national subsidiary is or is not the same entity as the parent company, companies with the same name can be entirely unrelated in different jurisdictions. It appears that we may need a multi-level approach, combining different techniques for different phases of the integration process. We do not a priori believe that using SQL VIEWs for unit and modeling conversion, and then OWL for unifying terminology on top of this, were the whole solution. Even if this were the solution, the pipeline from the relational sources to SPARQL and OWL needs to be optimized for real-world BI information volumes, and the query language needs to be able to express the business questions and needs to interface with the reporting tools the analyst has come to expect. Our answer so far consists of a SPARQL extension with non-recursive rules, roughly equivalent to SQL VIEWs in expressive power, tightly integrated to the query engine. There is also limited support for recursion through transitive subqueries; thus one can compactly express things like &quot;all parts of all assemblies and subassemblies must satisfy applicable safety requirements, where the requirements depend on the type of the part in question.&quot; This is only an intermediate step. We believe that a database-scale generic inference engine with at least Datalog power, with second-order extensions like computed predicates, is needed, executing inside the DBMS, benefiting from the whole array of optimizations database-science expects of execution engines, as part of the answer. This will not relieve the analyst of having to consider that the currency rates in effect at the time of conversion must be taken into account when calculating profits, but this will at least make expressing this and similar pieces of context more compact. We note that time-to-answer has historically won over raw performance. This was also the case for RDBMS when these were the fresh challenger to the CODASYL incumbents, just as was the case with the adoption of high-level languages. The key is that the raw performance must be sufficient for the real world task. With the adoption of the database lessons outlined in the previous section, we believe this to be the case for GDB (and thus, RDF). Substantiating the Claims Benchmarks have a stellar record for improving any metric they measure. The question is, how can we make a metric that measures GDB&#39;s ability to deliver on its claim to fame -- time-to-answer for big data -- with all the integration and other complexities this entails? So far, GDB benchmarks have consisted of workloads where RDBMS are clearly better (e.g., LUBM, or the Berlin SPARQL Benchmark). This does not remove their usefulness for GDB, but does not constitute a GDB selling point, either. We suggest a dual approach. The first part is demonstrating that GDB is scalable for BI: We take the industry standard decision support benchmark TPC-H, which is very favorable to RDB and quite unfavorable to GDB, and show that we can tackle the workload at reasonable cost. If TPC-H is all one wants, an RDBMS will stay a better fit, but then this benchmark does not capture any of the heterogeneity, schema evolution, or other such requirements faced by real-world data warehouses. This is still a qualification test, not the selling point. The issue of benchmark is inextricably tied to the issue of messaging. There must be a compelling story, with which the IT community can identify. Further, the benchmark must capture real-world challenges in the area of interest. With all this, the benchmark should not be too expensive to run. Here too, a multistage approach suggests itself. Our tentative answer to this question is the Social Intelligence Benchmark (SIB), developed together with CWI and other partners in the LOD2 consortium. This simulates a social network and combines an online workload with complex analytics. This benchmark should cover all of the target areas of the LOD2 project, so that the project itself generates its own metric of success. The project has clear data integration targets, especially as applies to Web and Linked Data. Questions of integration with enterprise sources need to be further developed; for example, comparing CRM data with extractions from the online conversation space for market research. Data integration will invariably involve human effort, and the area cannot be satisfactorily covered with metrics of scale and throughput alone. Development time, accuracy of results, and cost of maintenance are all factors. Furthermore, the task being modeled must correspond to reality, still without being too domain-specific or prohibitively time-consuming to implement. Conclusions The data driven world will increase rewards for efficiency in data integration. We believe that such efficiency crucially depends on semantics. Real world requirements just might throw the database and AI communities together with enough heat and pressure for fusion to ignite, allegorically speaking. Without a clear and present need, the geek world analog of electrostatic repulsion will keep the communities separate, as has been the case thus far, and no new, qualitatively-different element will arise. Efforts such as this STI Summit and the LOD2 Project are needed for setting directions and communicating the requirement to the research world. In our fusion analogy, this is the field which directs the nuclei to collide. Once there is an actual reaction that produces more than it consumes by a sufficient margin, regular business dynamics will take over, and we will have an industry with several products of comparable capability, as well as a set of metrics, all to the benefit of the end user. References TPC-H results pages Daniel Abadi&#39;s Ph.D. Thesis, Query Execution in Column-Oriented Database Systems ( PDF ) Our VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data ( HTML | PDF ) CWI&#39;s ROX: Run-time Optimization of XQueries ( PDF ) The LOD2 Project web site</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<i><b>Note:</b> The following was written prior to the event, but was not posted until later due to human error.</i>
</p>

<p>The <a href="http://sti2.org/" id="link-id0x261e3798">Semantic Technology Institute</a> (<a href="http://sti2.org/" id="link-id0x243dac30">STI</a>) is organizing <a href="http://summit2011.sti2.org/" id="link-id0x25fc4e68">a meeting</a> around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.</p>


<h3>GDB for the Data Driven Age</h3>

<p>Databases and knowledge representation both have decades of history, but to date the exchange of ideas and techniques between these disciplines has been limited. The intuition that there would be value in greater cooperation has not failed to occur to researchers on either side; after all, both sides deal with data. From this, we have seen deductive databases emerge, as well as more recently &quot;database friendly&quot; profiles of OWL.</p>

<p>In this position paper we will examine what, in the most concrete terms, is needed in order to bring leading edge database technology together with expressive querying and reasoning. This draws on our experience in building <a href="http://virtuoso.openlinksw.com/" id="link-id0x240cdd28">Virtuoso</a>, one of today&#39;s leading <a href="http://dbpedia.org/page/Graph_database" id="link-id0x24ceaae0">graph data stores</a>. Following this, we argue for the creation of benchmarks and challenges that in fact do reflect reality and facilitate open and fair comparison of products and technologies.</p>

<p>Data integration is often mentioned as the motivating use case for GDB, commonly popularized today as RDF. Database research has over the past few years produced great advances for business intelligence (i.e., complex queries and read-mostly workloads). These advances are typified by compressed columnar storage and architecture-conscious execution models, mostly based on the idea of always processing multiple sets of values in each operation (vectoring). With these techniques, raw performance with relatively simple schemas and regular data (e.g., TPC-H) is no longer a barrier to extracting value from data.</p>

<p>A similar breakthrough has not been seen on the semantics side. Data integration still requires manual labor. Publishing GDB datasets is a good and necessary intermediate stage, but producing these datasets from diverse sources is not fundamentally different from doing the same work without GDB or RDF. Even so, GDB and RDF serve as a catalyst for a culture of publishing datasets.</p>

<p>GDB, as a base model for integration, offers the following benefits over a purely relational result format: </p>

<ul>
<li>All entities have globally unique identifiers.</li>
<li>Any statements may be associated ad hoc to any entities.</li>
<li>These statements can be scoped into graphs according to their provenance, time, validity, etc.</li> </ul>

<p>Obtaining this flexibility on a relational basis would simply require moving to an graph-like representation with essentially one-row-per-attribute. Indeed, we see key-value stores being used in online applications with high volatility of schema (e.g., social networks, search); and we also see relational applications making provisions for post-hoc addition of per-entity attributes (i.e., associating a bag of mixed non-first normal form data with entities). The benefits of a schema-last approach are recognized in many places.</p>

<p>GDB seems <i>a priori</i> a fit for all these requirements, thus how will it claim its place as a solution?</p>

<p>The first part of the answer lies in learning all the relevant database lessons. The second part lies in eliminating the impedance mismatch between querying and reasoning. The third and most important part consists of substantiating these claims in a manner that is understandable to the relevant publics, finally leading to the creation of a semantics-aware segment of the database industry. We will address each of these aspects in turn.</p>

<h4>GDB and RDB</h4>

<p>The problem is divided into storage format, execution, and query optimization. For the first two, Daniel Abadi&#39;s <a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf" id="link-id0x25ebd568">renowned Ph.D. thesis</a> holds most of the keys. Space efficiency is specially important for Linked Data, since data is often voluminous, and many datasets have to be brought together for integration. Access patterns are also unpredictable, with indexed-random-access predominating, as opposed to RDB BI workloads where sequential scans and hash joins represent the bulk of the work. However, we find that a sorted column-wise compressed representation of Linked Data with a single quad table for all statements gives excellent space efficiency and good random access as well as random insert speed. The space efficiency is close to par with the equivalent column-wise relational format, since three of the four columns of the quad table compress to almost nothing. As many sort orders as are necessary may be maintained, but we find that two are enough, with some extra data structures for dealing with queries where the predicate is unspecified. The details are found in VLDB 2010 Semdata workshop paper, <i><a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtDirectionsChallengesSemdata" id="link-id0x244a8010">Directions and Challenges for Semantically Linked Data</a></i>. Since GDB/RDF is a model typed at run time, the engine must support an &quot;<code>ANY</code>&quot; data type for columns and query variables, where values on successive rows may be of different types. This is a straightforward enhancement.</p>

<p>Vectored execution is traditionally associated with column stores because the per-row access cost is relatively high, thus needing to access many nearby rows at a time in order to amortize the overhead. Aside this, vectored execution provides many opportunities for parallelism, from the instruction level all the way to threading and distributed execution on clusters, thus some form of execution on large numbers of concurrent query states is needed for RDF stores, just as it is needed for RDBMS&quot;s.</p>

<p>Query optimization for GDBMS is similar to that for RDBMS, except that the statistics can no longer be collected by column and table, but must rather apply to individual entities and ranges of a single quad table. This can be provided through run-time sampling of the database based on constants in the query being optimized. This may take into account trivial inference such as expanding properties into the set of their sub-properties and the like. Beyond this, interleaving execution and optimization (as in <a href="http://oai.cwi.nl/oai/asset/14193/14193B.pdf" id="link-id0x264cfd20">ROX</a>) seems to offer limitless possibilities, especially when inference is introduced, making optimizer statistics less predictive. </p>

<p>In summary, starting with an RDBMS and going to GDB entails changes to all parts of the engine, but these changes are not fundamental. One does need to own the engine; however, otherwise the expertise for efficiently implementing these changes will not exist. Essentially any DBMS technique may be translated to a GDB use case, if its application can be decided at run-time. GDB may be schema-less, yet most datasets have fairly regular structure; the question is simply to reconstruct the needed statistics and schema information from the data on an as you go basis. Techniques with high up-front cost, like constructing specially ordered materializations for optimizing specific queries, are harder to deploy but still conceivable for GDB also.</p>

<h4>RDB and Inference</h4>

<p>Compared to the straightforwardly performance oriented world of database engines, the contours of the landscape become less defined when moving to inference. Databases, whether relational or schema-less all perform roughly the same functions but inference is more diverse. We include here also techniques like machine learning and meta-reasoning for guiding reasoning, although these might not strictly fit the definition.</p>

<p>As we posit that data integration is the motivating use case for GDB as opposed to RDB (Relational Database Model), we must ask which modes of inference are actually required for data integration. Further, we need to ask whether these inferences ought to be applied as a preprocessing step (ETL or forward chaining), or as needed (backward chaining). Some low-hanging fruit can be collected by simply constructing class or property hierarchies; e.g., in the data at hand, the following properties have the meaning of company name, and the following classes have the meaning of company. We have found that such techniques can be efficiently supported at run-time, without materialization, if the support is simply built into the engine, which is in itself straightforward as long as one controls the engine. The same applies to trivial identity resolution, such as <code>owl:sameAs</code> or resolution of identity based on sharing an inverse-functional property value. These things take longer at run-time, but if one caches and reuses the result, one can get around materialization.</p>

<p>We do not believe in weak statements of identity, as in <i>X is similar to Y,</i> since the meaning of similarity is entirely contextual. X and Y may or may not be interchangeable depending on the application; thus the statement on identity needs to be strong, but it must be easy to modify the grounds on which such a statement is made. This is a further argument for why one should not automatically materialize consequences of identity, particularly if dealing with web data where identity is especially problematic.</p>

<p>Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling (&quot;address as many columns in customer table&quot; vs. &quot;address normalized away under a contact entity&quot;), normalization (&quot;first name&quot; and &quot;last name&quot; as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!</p>

<p>If data exists, the conversion questions are often answerable but their answer depends on context -- e.g., date of transaction for currency exchange rate; source of data for the definition of blood pressure.</p>

<p>Alongside these, there remain issues of identity, e.g., depending on the perspective, a national subsidiary is or is not the same entity as the parent company, companies with the same name can be entirely unrelated in different jurisdictions.</p>

<p>It appears that we may need a multi-level approach, combining different techniques for different phases of the integration process. We do not <i>a priori</i> believe that using SQL VIEWs for unit and modeling conversion, and then OWL for unifying terminology on top of this, were the whole solution. Even if this were the solution, the pipeline from the relational sources to SPARQL and OWL needs to be optimized for real-world BI information volumes, and the query language needs to be able to express the business questions and needs to interface with the reporting tools the analyst has come to expect.</p>

<p>Our answer so far consists of a SPARQL extension with non-recursive rules, roughly equivalent to SQL VIEWs in expressive power, tightly integrated to the query engine. There is also limited support for recursion through transitive subqueries; thus one can compactly express things like &quot;all parts of all assemblies and subassemblies must satisfy applicable safety requirements, where the requirements depend on the type of the part in question.&quot;</p>

<p>This is only an intermediate step. We believe that a database-scale generic inference engine with at least Datalog power, with second-order extensions like computed predicates, is needed, executing inside the DBMS, benefiting from the whole array of optimizations database-science expects of execution engines, as part of the answer.</p>

<p>This will not relieve the analyst of having to consider that the currency rates in effect at the time of conversion must be taken into account when calculating profits, but this will at least make expressing this and similar pieces of context more compact.</p>

<p>We note that time-to-answer has historically won over raw performance. This was also the case for RDBMS when these were the fresh challenger to the CODASYL incumbents, just as was the case with the adoption of high-level languages. The key is that the raw performance must be sufficient for the real world task. With the adoption of the database lessons outlined in the previous section, we believe this to be the case for GDB (and thus, RDF).</p>

<h4>Substantiating the Claims</h4>

<p>Benchmarks have a stellar record for improving any metric they measure. The question is, how can we make a metric that measures GDB&#39;s ability to deliver on its claim to fame -- time-to-answer for big data -- with all the integration and other complexities this entails?</p>

<p>So far, GDB benchmarks have consisted of workloads where RDBMS are clearly better (e.g., LUBM, or the Berlin SPARQL Benchmark). This does not remove their usefulness for GDB, but does not constitute a GDB selling point, either.</p>

<p>We suggest a dual approach. The first part is demonstrating that GDB is scalable for BI: We take the industry standard decision support benchmark TPC-H, which is very favorable to RDB and quite unfavorable to GDB, and show that we can tackle the workload at reasonable cost. If TPC-H is all one wants, an RDBMS will stay a better fit, but then this benchmark does not capture any of the heterogeneity, schema evolution, or other such requirements faced by real-world data warehouses. This is still a qualification test, not the selling point.</p>

<p>The issue of benchmark is inextricably tied to the issue of messaging. There must be a compelling story, with which the IT community can identify. Further, the benchmark must capture real-world challenges in the area of interest. With all this, the benchmark should not be too expensive to run. Here too, a multistage approach suggests itself.</p>

<p>Our tentative answer to this question is the Social Intelligence Benchmark (SIB), developed together with CWI and other partners in the LOD2 consortium. This simulates a social network and combines an online workload with complex analytics. This benchmark should cover all of the target areas of the LOD2 project, so that the project itself generates its own metric of success. The project has clear data integration targets, especially as applies to Web and Linked Data. Questions of integration with enterprise sources need to be further developed; for example, comparing CRM data with extractions from the online conversation space for market research.</p>

<p>Data integration will invariably involve human effort, and the area cannot be satisfactorily covered with metrics of scale and throughput alone. Development time, accuracy of results, and cost of maintenance are all factors. Furthermore, the task being modeled must correspond to reality, still without being too domain-specific or prohibitively time-consuming to implement.</p>

<h4>Conclusions</h4>

<p>The data driven world will increase rewards for efficiency in data integration. We believe that such efficiency crucially depends on semantics. Real world requirements just might throw the database and AI communities together with enough heat and pressure for fusion to ignite, allegorically speaking. Without a clear and present need, the geek world analog of electrostatic repulsion will keep the communities separate, as has been the case thus far, and no new, qualitatively-different element will arise.</p>

<p>Efforts such as this STI Summit and the LOD2 Project are needed for setting directions and communicating the requirement to the research world. In our fusion analogy, this is the field which directs the nuclei to collide.</p>

<p>Once there is an actual reaction that produces more than it consumes by a sufficient margin, regular business dynamics will take over, and we will have an industry with several products of comparable capability, as well as a set of metrics, all to the benefit of the end user.</p>

<h4>References</h4>

<ul>
 <li>
  <p>TPC-H <a href="http://www.tpc.org/tpch/" id="link-id0x25b910e8">results pages</a>
  </p>
 </li>

<li>
  <p>Daniel Abadi&#39;s Ph.D. Thesis, <i>Query Execution in 
      Column-Oriented Database Systems</i> ( <a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf" id="link-id0x25f8eeb0">PDF</a> )</p>
</li>

<li>
  <p>Our VLDB 2010 Semdata workshop paper, <i>Directions and Challenges for Semantically Linked Data</i> ( <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtDirectionsChallengesSemdata" id="link-id0x25f88520">HTML</a> | <a href="http://virtuoso.openlinksw.com/whitepapers/Directions_and_Challenges_for_Semantically_Linked_Data.pdf" id="link-id0x271416b8">PDF</a> )</p>
</li>

<li>
  <p>CWI&#39;s <i>ROX: Run-time Optimization of XQueries</i> ( <a href="http://oai.cwi.nl/oai/asset/14193/14193B.pdf" id="link-id0x2699ac78">PDF</a> )</p>
</li>

<li>
  <p>The <a href="http://lod2.eu/" id="link-id0x25856fc8">LOD2 Project web site</a>
  </p>
</li>
</ul>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-07-22#1695">
  <rss:title>The 2011 STI Semantic Summit</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-07-22T15:49:15Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was recently at the STI 2011 summit in Riga, Latvia. This is a meeting of senior participants in the semantic web and sem tech scene, organized by STI of Dieter Fensel fame, with board members like Michael Brodie, Mark Greaves, and Jim Hendler. This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had Peter Boncz of CWI, of MonetDB and VectorWise fame, attending the proceedings. Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background. At present, CWI and OpenLink are working together in the LOD2 EU FP7 project, around the general topic of bringing the best of Relational Database (RDB) science to the Graph Database (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink&#39;s column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is TPC-H, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts. Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL. So let&#39;s get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from Bio2RDF pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced. The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB. The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it. This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an &quot;AI-complete&quot; problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics. Quite so. We are somewhere in the middle. I&#39;d say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury. Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example, owl:sameAs will not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions. It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions: Database technology: Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale. Benchmarks: These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC. After these basics are more or less in hand, we have a vista of more diverse questions: What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine. Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS? Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? Datalog is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? Berkeley Orders Of Magnitude claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference. In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon. The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about &quot;pay as you go&quot; integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this. Schema.org and Microdata got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format. All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was recently at the <a href="http://www.openlinksw.com:80/www.sti2.org/events/2011-sti-semantic-summit" id="link-id0x2308d838">STI 2011 summit in Riga, Latvia</a>.
This is a meeting of senior participants in the semantic web and sem tech scene, organized by <a href="http://www.openlinksw.com:80/www.sti2.org/" id="link-id0x25076168">STI</a> of <a href="http://dbpedia.org/page/Dieter_Fensel" id="link-id0x24d2e998">Dieter Fensel</a> fame, with board members like <a href="http://www.michaelbrodie.com/" id="link-id0x224b4b58">Michael Brodie</a>, <a href="http://www.iks-project.eu/community/people/mark-greaves" id="link-id0x2308d4a8">Mark Greaves</a>, and <a href="http://dbpedia.org/page/James_Hendler" id="link-id0x24c192d0">Jim Hendler</a>.</p>

<p>This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had <a href="http://homepages.cwi.nl/~boncz/" id="link-id0x26654260">Peter Boncz</a> of CWI, of MonetDB and VectorWise fame, attending the proceedings.</p>

<p>Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background.</p>

<p>At present, CWI and <a href="http://www.openlinksw.com/" id="link-id0x24724fe0">OpenLink</a> are working together in the <a href="http://lod2.eu/" id="link-id0x24e20d90">LOD2 EU FP7 project</a>, around the general topic of bringing the best of <a href="http://dbpedia.org/page/Relational_database" id="link-id0x2475f128">Relational Database</a> (RDB) science to the <a href="http://dbpedia.org/page/Graph_database" id="link-id0x2474e988">Graph Database</a> (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink&#39;s column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is <a href="http://dbpedia.org/resource/TPC-H" id="link-id0x22a96588">TPC-H</a>, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts.</p>

<p>Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL.</p>

<p>So let&#39;s get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from <a href="http://www.bio2rdf.org/" id="link-id0x2315c088">Bio2RDF</a> pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced.</p>

<p>The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB.</p>

<p>The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it.</p>

<p>This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an &quot;<a href="http://dbpedia.org/page/AI-complete" id="link-id0x24754170">AI-complete</a>&quot; problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics. </p>

<p>Quite so. We are somewhere in the middle. I&#39;d say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury.</p>

<p>Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example, <code>owl:sameAs</code> will not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions.</p>

<p>It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions:</p>

<ul>
 <li>
  <p>
    <b>Database technology:</b> Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale.</p>
 </li>

<li>
  <p>
    <b>Benchmarks:</b> These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC.</p>
</li>
</ul>

<p>After these basics are more or less in hand, we have a vista of more diverse questions:</p>

<ul>
 <li>
  <p>What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine. </p>
 </li>

<li>
  <p>Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS? </p>
</li>

<li>
  <p>Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? <a href="http://dbpedia.org/page/Datalog" id="link-id0x2403b2f0">Datalog</a> is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? <a href="http://boom.cs.berkeley.edu/" id="link-id0x23ed5730">Berkeley Orders Of Magnitude</a> claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference.</p>
</li>
</ul>

<p>In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon.</p>

<p>The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about &quot;pay as you go&quot; integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this.</p>

<p>
<a href="http://schema.org/" id="link-id0x2475a708">Schema.org</a> and <a href="http://www.w3.org/TR/microdata/" id="link-id0x2a6f8b40">Microdata</a> got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format. </p>

<p>All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1692">
  <rss:title>Transaction Semantics in RDF and Relational Models</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T23:55:43Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">As a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes. In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts SPO. Any other index-like setting with any possible key order will have similar issues. Insert (Create) and Delete INSERT and DELETE as defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that a DELETE may delete a triple which the DELETE has not read; thus the delete set is not a subset of the read set. The SQL equivalent is the DELETE FROM table WHERE key IN ( SELECT key1 FROM other_table ) expression, supposing it were implemented as a scan of other_table and an index lookup followed by DELETE on table. The meaning of INSERT is that the triples in question exist after the operation, and the meaning of DELETE is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples. Suppose that the triples { 1 0 0 }, { 1 5 6 }, and { 1 5 7 } exist in the beginning. If we DELETE { 1 ?x ?y } and concurrently INSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject 1 or would have the three just inserted. Suppose the INSERT inserts the first triple, { 1 2 4 }. The DELETE at the same time reads all triples with subject 1. The exclusive read waits for the uncommitted INSERT. The INSERT then inserts the second triple, { 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no { 1 2 3 } was read, or causes a deadlock. The first corresponds to REPEATABLE READ isolation; the second to SERIALIZABLE. We would not get the desired end-state of either all the inserted triples or no triples with subject 1 if the read or the DELETE were not serializable. Furthermore if a DELETE template produced a triple that did not exist in the pre-image, the DELETE semantics still imply that this also does not exist in the after-image, which implies serializability. Read and Update Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted. The initial state is a balance 10 b balance 10 We transfer 1 from a to b, and at the same time transfer 2 from b to a. The end state must have a at 11 and b at 9. A relational database needs REPEATABLE READ isolation for this. With RDF, txn1 reads that a has a balance of 10. At the same time, txn1 reads the balance of a. txn2 waits because the read of txn1 is exclusive. txn1 proceeds and read the balance of b. It then updates the balance of a and b. All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of a, since RDF does not really have an update-in-place, consists of deleting { a balance 10 } and inserting { a balance 9 }. This gets done and txn1 commits. At this point, txn2 proceeds after its wait on the row that stated { a balance 10 }. This row is now gone, and txn2 sees that a has no balance, which is quite possible in RDF&#39;s schema-less model. We see that REPEATABLE READ is not adequate with RDF, even though it is with relational. The reason why there is no UPDATE-in-place is that the PRIMARY KEY of the triple includes all the parts, including the object. Even in a RDBMS, an UPDATE of a primary key part amounts to a DELETE-plus-INSERT. One could here argue that an implementation might still UPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case. Thus we see that the read of the balance must be SERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read of txn2 waits on the lock that is conceptually before the first possible match of { a balance ?x }. locking order and OLTP To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock. This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further. RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row. I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this. It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes. Pessimistic Vs. Optimistic Concurrency Control We have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation. To implement SERIALIZABLE isolation, i.e., the guarantee that if a transaction twice performs a COUNT the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting. Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough. The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made. Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes. In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization. Eventual Consistency Web scale systems that need to maintain consistent state across multiple data centers sometimes use &quot;eventual consistency&quot; schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect. Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application&#39;s viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record. Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e., DELETEd + INSERTed quads, as there is no UPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site has INSERTed a quad that another has DELETEd in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS. As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels. Conclusions We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be REPEATABLE READ becomes SERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same. Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>As a part of defining benchmark audit for testing <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x1cfc6e38">ACID</a> properties on <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1f1302b8">RDF</a> stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results.  In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.</p>

<p>In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads.  We will use numbers for IRIs and literals.  In most implementations, the internal representation for these is indeed a number (or at least some <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1728a9a8">data</a> type that has a well defined collation order).  For ease of presentation, we consider a single index with key parts <code>SPO</code>.  Any other index-like setting with any possible key order will have similar issues. </p>

<h2>Insert (Create) and Delete </h2>

<p>
<code>INSERT</code> and <code>DELETE</code> as defined in <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x16dee7f8">SPARQL</a> are queries which generate a result set which is then used for instantiating triple patterns.  We note that a <code>DELETE</code> may delete a triple which the <code>DELETE</code> has not read; thus the delete set is not a subset of the read set.  The <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x1e3afb78">SQL</a> equivalent is the </p>

<blockquote>
 <code><pre>DELETE FROM table WHERE key IN 
   ( SELECT key1 FROM other_table )</pre>
 </code>
</blockquote>

<p>expression, supposing it were implemented as a scan of <code>other_table</code> and an index lookup followed by <code>DELETE</code> on table. </p>

<p>The meaning of <code>INSERT</code> is that the triples in question exist after the operation, and the meaning of <code>DELETE</code> is that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples. </p>

<p>Suppose that the triples <code>{ 1 0 0 }</code>, <code>{ 1 5 6 }</code>, and <code>{ 1 5 7 }</code> exist in the beginning. If we <code>DELETE { 1 ?x ?y }</code> and concurrently <code>INSERT { 1 2 4 . 1 2 3 . 1 3 5 }</code>, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that.  Thus the end state would either have no triples with subject <code>1</code> or would have the three just inserted. </p>

<p>Suppose the <code>INSERT</code> inserts the first triple, <code>{ 1 2 4 }</code>.  The <code>DELETE</code> at the same time reads all triples with subject <code>1</code>.  The exclusive read waits for the uncommitted <code>INSERT</code>.  The <code>INSERT</code> then inserts the second triple, <code>{ 1 2 3 }</code>. Depending on the isolation of the read, this either succeeds, since no <code>{ 1 2 3 }</code> was read, or causes a deadlock.  The first corresponds to <code>REPEATABLE READ</code> isolation; the second to <code>SERIALIZABLE</code>.</p>

<p>We would not get the desired end-state of either <i>all the inserted triples</i> or <i>no triples with subject <code>1</code></i> if the read or the <code>DELETE</code> were not serializable.</p>

<p>Furthermore if a <code>DELETE</code> template produced a triple that did not exist in the pre-image, the <code>DELETE</code> semantics still imply that this also does not exist in the after-image, which implies serializability.</p>


<h2>Read and Update</h2>

<p>Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.</p>

<p>The initial state is </p>

<blockquote>
<code><pre>a  balance  10
b  balance  10</pre></code>
</blockquote>

<p>We transfer <code>1</code> from <code>a</code> to <code>b</code>, and at the same time transfer <code>2</code> from <code>b</code> to <code>a</code>.  The end state must have <code>a</code> at <code>11</code> and <code>b</code> at <code>9</code>.</p>

<p>A relational database needs <code>REPEATABLE READ</code> isolation for this.</p>

<p>With RDF, <code>txn1</code> reads that <code>a</code> has a <code>balance</code> of <code>10</code>.   At the same time, <code>txn1</code> reads the <code>balance</code> of <code>a</code>.  <code>txn2</code> waits because the read of <code>txn1</code> is exclusive.  <code>txn1</code> proceeds and read the <code>balance</code> of <code>b</code>.  It then updates the <code>balance</code> of <code>a</code> and <code>b</code>. </p>

<p>All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of <code>a</code>, since RDF does not really have an update-in-place, consists of deleting <code>{ a balance 10 }</code> and inserting <code>{ a balance 9 }</code>.  This gets done and <code>txn1</code> commits. At this point, <code>txn2</code> proceeds after its wait on the row that stated <code>{ a balance 10 }</code>.  This row is now gone, and <code>txn2</code> sees that <code>a</code> has no balance, which is quite possible in RDF&#39;s <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x1ebb94c8">schema</a>-less model.</p>

<p>We see that <code>REPEATABLE READ</code> is not adequate with RDF, even though it is with relational. The reason why there is no <code>UPDATE</code>-in-place is that the <code>PRIMARY KEY</code> of the triple includes all the parts, including the object. Even in a <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1ca86578">RDBMS</a>, an <code>UPDATE</code> of a primary key part amounts to a <code>DELETE</code>-plus-<code>INSERT</code>.  One could here argue that an implementation might still <code>UPDATE</code>-in-place if the key order were not changed.  This would resolve the special case of the accounts but not a more general case.</p>

<p>Thus we see that the read of the balance must be <code>SERIALIZABLE</code>.  This means that the read locks the space before the first balance, so that no insertion may take place.  In this way the read of <code>txn2</code> waits on the lock that is conceptually before the first possible match of <code>{ a balance ?x }</code>.</p>


<h2>locking order and OLTP </h2>

<p>To implement <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1e811d68">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1df9c990">C</a>, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality.  In this way, the locks with the highest likelihood for contention are held for the least time.  If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first.  In this way, the workload would not deadlock.  In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer.  Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.</p>

<p>This is the conventional relational view of the matter.  In more recent times, in-memory schemes with deterministic lock acquisition (<a href="http://cs-www.cs.yale.edu/homes/dna/papers/determinism-vldb10.pdf" id="link-id0x1c5d9340">Abadi VLDB 2010</a>) or single-threaded atomic execution of transactions (<a href="http://bird.cs.tu-berlin.de:8008/birte2010/" id="link-id0x1ec0ed18">Uni Munich BIRTE workshop at VLDB2010</a>, <a href="http://www.voltdb.com/" id="link-id0x1ab6e380">VoltDB</a>) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations.  These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.</p>

<p>RDBMS usually implement row-level locking.  This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row.  This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.  </p>

<p>I would argue that it is not essential to enforce transactional guarantees in units of rows.  The guarantees must apply between data that is <i>read</i> and <i>written</i> by a transaction.  It does not need to apply to columns that the transaction does not reference.  To take the TPC-C example, the <i>new order</i> transaction updates the stock level and the <i>delivery</i> transaction updates the delivery count on the stock table. In practice, a <i>delivery</i> and a <i>new order</i> falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.</p>

<p>It does not seem <i>a priori</i> necessary to recreate the row as a unit of concurrency control in RDF.  One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.</p>


<h2>Pessimistic Vs. Optimistic Concurrency Control </h2>

<p>We have so far spoken only in terms of row-level locking, which is to my <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x1ebbf3f8">knowledge</a> the most widely used model in RDBMS, and one we implement ourselves.  Some databases (e.g., <a class="auto-href" href="http://dbpedia.org/resource/MonetDB" id="link-id0x1e771f48">MonetDB</a> and <a class="auto-href" href="http://www.ingres.com/vectorwise/" id="link-id0x1f3b4830">VectorWise</a>) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable.  Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.  </p>

<p>To implement <code>SERIALIZABLE</code> isolation, i.e., the guarantee that if a transaction twice performs a <code>COUNT</code> the result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order.  The same thing may be done in an optimistic setting.</p>

<p>
  <a href="http://event.cwi.nl/SIGMOD-RWE/2010/22-7f15a1/paper.pdf" id="link-id0x1d5de810">Positional Handling of Updates in Column Stores</a> [Heman, Zukowski, <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x1e7644d8">CWI</a> science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported.  There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough. </p>

<p>The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models.  We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme.  The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.</p>

<p>Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes.  A simple example is locking in key order when doing an operation on a set of values.  A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first.  We note that this latter trick also benefits optimistic schemes.</p>

<p>In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1f05c1a8">optimization</a>.</p>

<h2>Eventual Consistency </h2>

<p>Web scale systems that need to maintain consistent state across multiple data centers sometimes use &quot;eventual consistency&quot; schemes.  <a class="auto-href" href="http://dbpedia.org/resource/Two-phase_commit_protocol" id="link-id0x1cebd340">Two-phase-commit</a> becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.</p>

<p>Eventual consistency schemes (<a href="http://dbpedia.org/page/Dynamo_(storage_system)" id="link-id0x1f9db8f8">Amazon Dynamo</a>, <a href="http://research.yahoo.com/project/212" id="link-id0x1da3db80">Yahoo! PNUTS</a>) maintain history <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x1ec4dbc8">information</a> on the record which is the unit of concurrency control.  The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application&#39;s viewpoint.  Application logic can then be applied to reconciling differing copies of the same logical record. </p>

<p>Such a scheme seems <i>a priori</i> ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad.  We first note that only recently changed (i.e., <code>DELETEd + INSERTed</code> quads, as there is no <code>UPDATE</code>-in-place) need history information.  This history information can be stored away from the quad itself, thus not disrupting compression.  When detecting that one site has <code>INSERTed</code> a quad that another has <code>DELETEd</code> in the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged.  The same can apply to conflicting values of properties that for the application should be single-valued.  Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.</p>

<p>As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema.  Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels. </p>


<h2>Conclusions </h2>

<p>We have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be <code>REPEATABLE READ</code> becomes <code>SERIALIZABLE</code>; and row-level locking becomes locking at the level of a single attribute value.  For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.</p>

<p>Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems.  We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.</p>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1690">
  <rss:title>RDF and Transactions</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:52:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists. Transactions are certainly not the first thing that comes to mind when one hears &quot;RDF&quot;. We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says &quot;deadlock&quot;. &quot;Deadlock&quot; is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF? There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases. At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an UPDATE request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID. If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction&#39;s result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data. There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like). If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS. There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side. Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice. Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data. Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage. As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions. As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less. This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data. As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry. Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions. With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable. If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent INSERTs and DELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an INSERT or DELETE fail because of deadlock with multiple users. Vectored INSERT and DELETE are sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the INSERT or DELETE of a small graph. Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations. In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will here talk about <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x249bc940">RDF</a> and transactions for developers in general. The next one talks about specifics and is for specialists.</p>

<p>Transactions are certainly not the first thing that comes to mind when one hears &quot;RDF&quot;.  We have at times used a recruitment questionnaire where we ask applicants to define a transaction.  Many vaguely remember that it is a unit of work, but usually not more than that.  We sometimes get questions from users about why they get an error message that says &quot;deadlock&quot;.  &quot;Deadlock&quot; is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order.  What does this have to do with RDF?</p>

<p>There are in fact users who even use XA with a <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x22c8dbc8">Virtuoso</a>-based RDF application.  <a class="auto-href" href="http://semanticweb.org/id/Franz_Inc" id="link-id0x27bd0c08">Franz</a> also has publicized their development of full <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x283985c8">ACID</a> capabilities for <a class="auto-href" href="http://semanticweb.org/id/AllegroGraph" id="link-id0x238ba438">AllegroGraph</a>.  RDF is a database <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x2864fef8">schema</a> model, and transactions will inevitably become an issue in databases.</p>

<p>At the same time, the developer population trained with <a class="auto-href" href="http://dbpedia.org/resource/MySQL" id="link-id0x284d2d80">MySQL</a> and <a class="auto-href" href="http://dbpedia.org/resource/PHP" id="link-id0x237230e8">PHP</a> is not particularly transaction-aware.  Transactions have gone out of style, declares the No-<a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x2920cc88">SQL</a> crowd.  Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post.  The <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x283f0588">SPARQL</a> language and protocol do not go into transactions, except for expressing the wish that an <code>UPDATE</code> request to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.  </p>

<p>If one says that a thing will either happen <i>in its entirety</i> or <i>not at all,</i> which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x238280f8">data</a> at the same time?  Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x276714b8">C</a>) consistency, which means that the transaction&#39;s result must not contradict restrictions the database is supposed to enforce.  RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.</p>

<p>There are, of course, database-like consistency criteria that one can express in RDF Schema and <a class="auto-href" href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x28625a90">OWL</a>, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).  </p>

<p>If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x249bf4f8">RDBMS</a> performance.  For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.</p>

<p>There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS.  RDF could be seen to be split between the schema-last world and the <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x249504f8">knowledge</a> representation world.  I will here focus on the schema-last side.</p>

<p>Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions.  The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.</p>

<p>Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away.  This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.</p>

<p>Analytics workloads are not primarily about transactions, but still need to specify what happens with updates.  Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.</p>


<p>As mentioned before, the <a class="auto-href" href="http://lod2.eu/" id="link-id0x27d952d0">LOD2</a> project is at the crossroads of RDF and database.  I construe its mission to be the making of RDF into a respectable database discipline.  Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.</p>

<p>As previously argued, we need well-defined and auditable benchmarks.  This again brings up the topic of transactions.  Once we embark on the database benchmark route, there is no way around this. <a class="auto-href" href="http://www.tpc.org/" id="link-id0x2359d2d0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x28edb770">H</a> mandates that the system under test support transactions, and the audit involves a test for this.  We can do no less.</p>

<p>This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.  </p>

<p>As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., <a class="auto-href" href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x23a55698">ODBC</a>, <a class="auto-href" href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x235cecf0">JDBC</a>, or the <a class="auto-href" href="http://jena.sourceforge.net/" id="link-id0x23213900">Jena</a> or <a class="auto-href" href="http://sourceforge.net/projects/sesame/" id="link-id0x277874d0">Sesame</a> frameworks), and setting the isolation options on the connection.  Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.</p>

<p>Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.</p>

<p>With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF.  It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.</p>

<p>If all users lock resources they need in the same order, there will be no deadlocks.  This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent <code>INSERTs</code> and <code>DELETEs</code>, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking.  These could still fail due to running out of space, though. With previous versions, there always was a possibility of having an <code>INSERT</code> or <code>DELETE</code> fail because of deadlock with multiple users.   Vectored <code>INSERT</code> and <code>DELETE</code> are sufficient for    making web crawling or archive maintenance practically deadlock free, since there the primary transaction is the <code>INSERT</code> or <code>DELETE</code> of a small graph. </p>

<p>Furthermore, since the <a class="auto-href" href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x23eadf50">SPARQL protocol</a> has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself.  If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking.  We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock.  Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.</p>

<p>In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1688">
  <rss:title>Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:32:28Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This article covers the changes we have made to the BSBM test driver during our series of experiments. Drill-down mode - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy. The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time. Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy. Permutation of query mix - In the BI workload, the queries are run in a random order on each thread in multiuser mode. Doing exactly the same thing on many threads is not realistic for large queries. The data access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities. For queries with a drill-down, the individual executions that make up the drill-down are still consecutive. New metrics - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the TPC-H Power and Throughput metrics. The Power is defined as (scale_factor / 284826) * 3600 / ((t0 * t1 * ... * tn) ^(1 / n)) The Throughput is defined as (scale_factor / 284826) * 3600 / ((t0 + t2 + ... + tn) / n) The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt). We consider this &quot;scale one.&quot; The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries. We also show the percentage each query represents from the total time the test driver waits for responses. Deadlock retry - When running update mixes, it is possible that a transaction gets aborted by a deadlock. We have made a retry logic for this. Cluster mode - Cluster databases may have multiple interchangeable HTTP listeners. With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these. Identifying matter - A version number was added to test driver output. Use of the new switches is also indicated in the test driver output. SUT CPU - In comparing results it is crucial to differentiate between in memory runs and IO bound runs. To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups). A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too. The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time. These changes will soon be available as a diff and as a source tree. This version is labeled BSBM Test Driver 1.1-opl; the -opl signifies OpenLink additions. We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver. There is more precise documentation of these options in the README file in the above distribution. The next planned upgrade of the test driver concerns adding support for &quot;RDF-H&quot;, the RDF adaptation of the industry standard TPC-H decision support benchmark for RDBMS. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): The Substance of Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements (this post)</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This article covers the changes we have made to the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2361bf18">BSBM</a> test driver during our series of experiments.</p>

<ul>
 <li>
  <p>
    <b>Drill-down mode</b> - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy.  The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time.  Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy.</p>
 </li>

<li>
  <b>Permutation of query mix</b> - In the BI workload, the queries are run in a random order on each thread in multiuser mode.  Doing exactly the same thing on many threads is not realistic for large queries. The <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x2834cec8">data</a> access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities.  For queries with a drill-down, the individual executions that make up the drill-down are still consecutive.</li>

<li>
  <p>
    <b>New metrics</b> - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the <a class="auto-href" href="http://www.tpc.org/" id="link-id0x236c5158">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x28814950">H</a> Power and Throughput metrics. </p>
<p>The <i>Power</i> is defined as</p> 
<blockquote>(scale_factor / 284826) *  3600 / ((t0 * t1 * ... * tn) ^(1 / n)) </blockquote>
<p>The <i>Throughput</i> is defined as</p> 
<blockquote>(scale_factor / 284826) *  3600 / ((t0 + t2 + ... +  tn) / n)</blockquote>
<p>The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt).  We consider this &quot;scale one.&quot;  The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.</p>

<p>We also show the percentage each query represents from the total time the test driver waits for responses. </p>
</li>

<li>
  <p>
    <b>Deadlock retry</b> - When running update mixes, it is possible that a transaction gets aborted by a deadlock.   We have made a retry logic for this.</p>
</li>

<li>
  <p>
    <b>Cluster mode</b> - Cluster databases may have multiple interchangeable <a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x240f9008">HTTP</a> listeners.  With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these.</p>
</li>

<li>
  <p>
    <b>Identifying matter</b> - A version number was added to test driver output.  Use of the new switches is also indicated in the test driver output.</p>
</li>

<li>
  <p>
    <b>SUT <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x249b7208">CPU</a></b> - In comparing results it is crucial to differentiate between in memory runs and IO bound runs.  To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups).  A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too.  The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time.</p>
</li>
</ul>

<p>These changes will soon be available <a href="http://blogs.usnet.private:8893/RPC2" id="link-id0x1f9a57c0">as a diff</a> and <a href="http://blogs.usnet.private:8893/RPC2" id="link-id0x1f2fea08">as a source tree</a>. This version is labeled <b><code>BSBM Test Driver 1.1-opl</code></b>; the <b><code>-opl</code></b> signifies OpenLink additions.  </p>

<p>We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver.  There is more precise documentation of these options in the README file in the above distribution.</p>

<p>The next planned upgrade of the test driver concerns adding support for &quot;<a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x2865ac68">RDF</a>-H&quot;, the RDF adaptation of the industry standard TPC-H decision support benchmark for <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x23597bb0">RDBMS</a>.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1db2be00">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1dfcc038">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x197c26d0">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d149cf0">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ab69450">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1e67d688">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1dad87c8">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1cc73830">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1d6879a8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dfae510">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1ef052a0">Benchmarks, Redux (part 11): The Substance of Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1dadddb0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e662ef0">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1df6fa70">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements <i>(this post)</i>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1687">
  <rss:title>Benchmarks, Redux (part 14): BSBM BI Mix</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:31:32Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post. Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted. Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit. Below we show samples of test driver output; the whole output is downloadable. 100 Mt Single User bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill \ -ucf bsbm/usecases/businessIntelligence/sparql.txt \ -dg http://bsbm.org http://localhost:8604/sparql 0: 43348.14ms, total: 43440ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 43.3481s / 43.3481s Elapsed runtime: 43.348 seconds QMpH: 83.049 query mixes per hour CQET: 43.348 seconds average runtime of query mix CQET (geom.): 43.348 seconds geometric mean runtime of query mix AQET (geom.): 0.492 seconds geometric mean runtime of query Throughput: 1494.874 BSBM-BI throughput: qph*scale BI Power: 7309.820 BSBM-BI Power: qph*scale (geom) 100 Mt 8 User Thread 6: query mix 3: 195793.09ms, total: 196086.18ms Thread 8: query mix 0: 197843.84ms, total: 198010.50ms Thread 7: query mix 4: 201806.28ms, total: 201996.26ms Thread 2: query mix 5: 221983.93ms, total: 222105.96ms Thread 4: query mix 7: 225127.55ms, total: 225317.49ms Thread 3: query mix 6: 225860.49ms, total: 226050.17ms Thread 5: query mix 2: 230884.93ms, total: 231067.61ms Thread 1: query mix 1: 237836.61ms, total: 237959.11ms Benchmark run completed in 237.985427s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 195.7931s / 237.8366s Total runtime (sum): 1737.137 seconds Elapsed runtime: 1737.137 seconds QMpH: 121.016 query mixes per hour CQET: 217.142 seconds average runtime of query mix CQET (geom.): 216.603 seconds geometric mean runtime of query mix AQET (geom.): 2.156 seconds geometric mean runtime of query Throughput: 2178.285 BSBM-BI throughput: qph*scale BI Power: 1669.745 BSBM-BI Power: qph*scale (geom) 1000 Mt Single User 0: 608707.03ms, total: 608768ms Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 608.7070s / 608.7070s Elapsed runtime: 608.707 seconds QMpH: 5.914 query mixes per hour CQET: 608.707 seconds average runtime of query mix CQET (geom.): 608.707 seconds geometric mean runtime of query mix AQET (geom.): 5.167 seconds geometric mean runtime of query Throughput: 1064.552 BSBM-BI throughput: qph*scale BI Power: 6967.325 BSBM-BI Power: qph*scale (geom) 1000 Mt 8 User bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill \ -ucf bsbm/usecases/businessIntelligence/sparql.txt \ -dg http://bsbm.org http://localhost:8604/sparql Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms Benchmark run completed in 2889.302566s Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 2211.2753s / 2889.1992s Total runtime (sum): 20481.895 seconds Elapsed runtime: 20481.895 seconds QMpH: 9.968 query mixes per hour CQET: 2560.237 seconds average runtime of query mix CQET (geom.): 2544.284 seconds geometric mean runtime of query mix AQET (geom.): 13.556 seconds geometric mean runtime of query Throughput: 1794.205 BSBM-BI throughput: qph*scale BI Power: 2655.678 BSBM-BI Power: qph*scale (geom) Metrics for Query: 1 Count: 8 times executed in whole run Time share 2.120884% of total execution time AQET: 54.299656 seconds (arithmetic mean) AQET(geom.): 34.607302 seconds (geometric mean) QPS: 0.13 Queries per second minQET/maxQET: 11.71547600s / 148.65379700s Metrics for Query: 2 Count: 8 times executed in whole run Time share 0.207382% of total execution time AQET: 5.309462 seconds (arithmetic mean) AQET(geom.): 2.737696 seconds (geometric mean) QPS: 1.34 Queries per second minQET/maxQET: 0.78729800s / 25.80948200s Metrics for Query: 3 Count: 8 times executed in whole run Time share 17.650472% of total execution time AQET: 451.893890 seconds (arithmetic mean) AQET(geom.): 410.481088 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 171.07262500s / 721.72939200s Metrics for Query: 5 Count: 32 times executed in whole run Time share 6.196565% of total execution time AQET: 39.661685 seconds (arithmetic mean) AQET(geom.): 6.849882 seconds (geometric mean) QPS: 0.18 Queries per second minQET/maxQET: 0.15696500s / 189.00906200s Metrics for Query: 6 Count: 8 times executed in whole run Time share 0.119916% of total execution time AQET: 3.070136 seconds (arithmetic mean) AQET(geom.): 2.056059 seconds (geometric mean) QPS: 2.31 Queries per second minQET/maxQET: 0.41524400s / 7.55655300s Metrics for Query: 7 Count: 40 times executed in whole run Time share 1.577963% of total execution time AQET: 8.079921 seconds (arithmetic mean) AQET(geom.): 1.342079 seconds (geometric mean) QPS: 0.88 Queries per second minQET/maxQET: 0.02205800s / 40.27761500s Metrics for Query: 8 Count: 40 times executed in whole run Time share 72.126818% of total execution time AQET: 369.323481 seconds (arithmetic mean) AQET(geom.): 114.431863 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 5.94377300s / 1824.57867400s The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization. With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good. For reference, we show a run with Virtuoso 6 at 100Mt. 0: 424754.40ms, total: 424829ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 424.7544s / 424.7544s Elapsed runtime: 424.754 seconds QMpH: 8.475 query mixes per hour CQET: 424.754 seconds average runtime of query mix CQET (geom.): 424.754 seconds geometric mean runtime of query mix AQET (geom.): 1.097 seconds geometric mean runtime of query Throughput: 152.559 BSBM-BI throughput: qph*scale BI Power: 3281.150 BSBM-BI Power: qph*scale (geom) and 8 user Thread 5: query mix 3: 616997.86ms, total: 617042.83ms Thread 7: query mix 4: 625522.18ms, total: 625559.09ms Thread 3: query mix 7: 626247.62ms, total: 626304.96ms Thread 1: query mix 0: 629675.17ms, total: 629724.98ms Thread 4: query mix 6: 667633.36ms, total: 667670.07ms Thread 8: query mix 2: 674206.07ms, total: 674256.72ms Thread 6: query mix 5: 695020.21ms, total: 695052.29ms Thread 2: query mix 1: 701824.67ms, total: 701864.91ms Benchmark run completed in 701.909341s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 616.9979s / 701.8247s Total runtime (sum): 5237.127 seconds Elapsed runtime: 5237.127 seconds QMpH: 41.031 query mixes per hour CQET: 654.641 seconds average runtime of query mix CQET (geom.): 653.873 seconds geometric mean runtime of query mix AQET (geom.): 2.557 seconds geometric mean runtime of query Throughput: 738.557 BSBM-BI throughput: qph*scale BI Power: 1408.133 BSBM-BI Power: qph*scale (geom) Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput. 100 Mt: sqrt ( 7771 * 2178 ); = 4114 1000 Mt: sqrt ( 6967 * 1794 ); = 3535 Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable. Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example. Also there should be more queries. At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people. Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an IN into a JOIN with the IN subquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimized HASH JOIN which besides must be partitioned at the larger scales. Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point. In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended. BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a JOIN and the cardinality of a GROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonable JOIN order since we always reorder the query regardless of the source formulation. BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different JOIN orders for different parameter values. I have not looked into whether this really makes a difference, though. There are places in BSBM-BI where using a HASH JOIN makes sense. We do not use HASH JOINs with RDF because there is an index for everything and making a HASH JOIN in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do HASH JOINs. This said, a HASH JOIN in the right place is a lot better than an index lookup. With TPC-H Q13, our best HASH JOIN is over 2x better than the best INDEX-based JOIN, both being well tuned. For questions like &quot;count the hairballs made in Germany reviewed by Japanese Hello Kitty fans,&quot; where two ends of a JOIN path are fairly selective doing the other as a HASH JOIN is good. This can, if the JOIN is always cardinality-reducing, even be merged inside an INDEX lookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful. Let us see the profile for a single user 100 Mt run. The database activity summary is -- select db_activity (0, &#39;http&#39;); 161.3M rnd  210.2M seq      0 same seg   104.5M same pg  45.08M same par      0 disk      0 spec disk      0B /      0 messages  2.393K fork See the post &quot;What Does BSBM Explore Measure&quot; for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get -- 172.4M rnd  220.8M seq      0 same seg   149.6M same pg  10.99M same par     21 disk    861 spec disk      0B /      0 messages     754 fork The throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains. Let us now go back to the first run with vector size 10000. The top of the CPU oprofile is as follows: 722309 15.4507 cmpf_iri64n_iri64n 434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n 294712 6.3041 itc_next_set 273488 5.8501 itc_vec_split_search 203970 4.3631 itc_dive_transit 199687 4.2714 itc_page_rcf_search 181614 3.8848 dc_itc_append_any 173043 3.7015 itc_bm_vec_row_check 146727 3.1386 cmpf_int64n 128224 2.7428 itc_vec_row_check 113515 2.4282 dk_alloc 97296 2.0812 page_wait_access 62523 1.3374 qst_vec_get_int64 59014 1.2623 itc_next_set_parent 53589 1.1463 sslr_qst_get 48003 1.0268 ds_add 46641 0.9977 dk_free_tree 44551 0.9530 kc_var_col 43650 0.9337 page_col_cmp_1 35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt 34589 0.7399 dv_compare 25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte 23088 0.4939 dk_free The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with P and S given. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted to HASH JOIN lookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile, GROUP BY seems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns. A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside ANDs and ORs would be good. TPC-H has such in Q21 and Q19. An OR with existences where there would be gain from good guesses of a subquery&#39;s selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like the CASE-WHEN in TPC-H Q12. We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more. This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): The Substance of Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM-BI Modifications Benchmarks, Redux (part 14): BSBM-BI Mix (this post) Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In this post, we look at how we run the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x236dcda8">BSBM</a>-BI mix.  We consider the 100 Mt and 1000 Mt scales with <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x284893c0">Virtuoso</a> 7 using the same hardware and software as in the previous posts.  The changes to workload and metric are given in the previous post.</p>

<p>Our intent here is to look at whether the metric works, and to see what results will look like in general.  We are as much testing the benchmark as we are testing the system-under-test (SUT).  The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance.  Anyway, for the sake of disclosure, we attach the query templates.  The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.</p>


<p>Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the <a class="auto-href" href="http://lod2.eu/" id="link-id0x23724860">LOD2</a> service we plan for this (see previous posts in this series).  This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.</p>

<p>Below we show samples of test driver output; the whole output is <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/br.tar.gz" id="link-id0x1b703ad8">downloadable</a>.</p>

<p>100 Mt Single User</p>

<blockquote>
 <code><pre>
bsbm/testdriver   -runs 1   -w 0 -idir /bs/1  -drill  \  
   -ucf bsbm/usecases/businessIntelligence/<a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x2385eb48">sparql</a>.txt  \  
   -dg <a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x22e2f508">http</a>://bsbm.org http://localhost:8604/sparql
</pre>
 </code>
</blockquote>

<blockquote>
 <code><pre>
0: 43348.14ms, total: 43440ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    43.3481s / 43.3481s
Elapsed runtime:        43.348 seconds
QMpH:                   83.049 query mixes per hour
CQET:                   43.348 seconds average runtime of query mix
CQET (geom.):           43.348 seconds geometric mean runtime of query mix
AQET (geom.):           0.492 seconds geometric mean runtime of query
Throughput:             1494.874 BSBM-BI throughput: qph*scale
BI Power:               7309.820 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>



<p>100 Mt 8 User </p>

<blockquote>
 <code><pre>
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms
Thread 8: query mix 0: 197843.84ms, total: 198010.50ms
Thread 7: query mix 4: 201806.28ms, total: 201996.26ms
Thread 2: query mix 5: 221983.93ms, total: 222105.96ms
Thread 4: query mix 7: 225127.55ms, total: 225317.49ms
Thread 3: query mix 6: 225860.49ms, total: 226050.17ms
Thread 5: query mix 2: 230884.93ms, total: 231067.61ms
Thread 1: query mix 1: 237836.61ms, total: 237959.11ms
Benchmark run completed in 237.985427s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    195.7931s / 237.8366s
Total runtime (sum):    1737.137 seconds
Elapsed runtime:        1737.137 seconds
QMpH:                   121.016 query mixes per hour
CQET:                   217.142 seconds average runtime of query mix
CQET (geom.):           216.603 seconds geometric mean runtime of query mix
AQET (geom.):           2.156 seconds geometric mean runtime of query
Throughput:             2178.285 BSBM-BI throughput: qph*scale
BI Power:               1669.745 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>1000 Mt Single User</p>

<blockquote>
 <code><pre>
0: 608707.03ms, total: 608768ms

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    608.7070s / 608.7070s
Elapsed runtime:        608.707 seconds
QMpH:                   5.914 query mixes per hour
CQET:                   608.707 seconds average runtime of query mix
CQET (geom.):           608.707 seconds geometric mean runtime of query mix
AQET (geom.):           5.167 seconds geometric mean runtime of query
Throughput:             1064.552 BSBM-BI throughput: qph*scale
BI Power:               6967.325 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>1000 Mt 8 User </p>

<blockquote>
 <code><pre>
bsbm/testdriver   -runs 8 -mt 8  -w 0 -idir /bs/10  -drill  \
   -ucf bsbm/usecases/businessIntelligence/sparql.txt   \
   -dg http://bsbm.org http://localhost:8604/sparql
</pre>
 </code>
</blockquote>

<blockquote>
 <code><pre>
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms
Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms
Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms
Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms
Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms
Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms
Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms
Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms
Benchmark run completed in 2889.302566s

Scale factor:           2848260
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    2211.2753s / 2889.1992s
Total runtime (sum):    20481.895 seconds
Elapsed runtime:        20481.895 seconds
QMpH:                   9.968 query mixes per hour
CQET:                   2560.237 seconds average runtime of query mix
CQET (geom.):           2544.284 seconds geometric mean runtime of query mix
AQET (geom.):           13.556 seconds geometric mean runtime of query
Throughput:             1794.205 BSBM-BI throughput: qph*scale
BI Power:               2655.678 BSBM-BI Power: qph*scale (geom)

Metrics for Query:      1
Count:                  8 times executed in whole run
Time share              2.120884% of total execution time
AQET:                   54.299656 seconds (arithmetic mean)
AQET(geom.):            34.607302 seconds (geometric mean)
QPS:                    0.13 Queries per second
minQET/maxQET:          11.71547600s / 148.65379700s

Metrics for Query:      2
Count:                  8 times executed in whole run
Time share              0.207382% of total execution time
AQET:                   5.309462 seconds (arithmetic mean)
AQET(geom.):            2.737696 seconds (geometric mean)
QPS:                    1.34 Queries per second
minQET/maxQET:          0.78729800s / 25.80948200s

Metrics for Query:      3
Count:                  8 times executed in whole run
Time share              17.650472% of total execution time
AQET:                   451.893890 seconds (arithmetic mean)
AQET(geom.):            410.481088 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          171.07262500s / 721.72939200s

Metrics for Query:      5
Count:                  32 times executed in whole run
Time share              6.196565% of total execution time
AQET:                   39.661685 seconds (arithmetic mean)
AQET(geom.):            6.849882 seconds (geometric mean)
QPS:                    0.18 Queries per second
minQET/maxQET:          0.15696500s / 189.00906200s

Metrics for Query:      6
Count:                  8 times executed in whole run
Time share              0.119916% of total execution time
AQET:                   3.070136 seconds (arithmetic mean)
AQET(geom.):            2.056059 seconds (geometric mean)
QPS:                    2.31 Queries per second
minQET/maxQET:          0.41524400s / 7.55655300s

Metrics for Query:      7
Count:                  40 times executed in whole run
Time share              1.577963% of total execution time
AQET:                   8.079921 seconds (arithmetic mean)
AQET(geom.):            1.342079 seconds (geometric mean)
QPS:                    0.88 Queries per second
minQET/maxQET:          0.02205800s / 40.27761500s

Metrics for Query:      8
Count:                  40 times executed in whole run
Time share              72.126818% of total execution time
AQET:                   369.323481 seconds (arithmetic mean)
AQET(geom.):            114.431863 seconds (geometric mean)
QPS:                    0.02 Queries per second
minQET/maxQET:          5.94377300s / 1824.57867400s
</pre>
 </code>
</blockquote>



<p>The <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x2809d998">CPU</a> for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x22cf75b8">data</a> to work on.  But final <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x238b94c8">optimization</a> is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more.  The numbers shown are with warm <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x23ad8c08">cache</a>.  The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.</p>

<p>With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit.  If the single-user run was at 800%, the Throughput would be 1488.  The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact.  Core multi-threading does not seem to hurt, at the very least.  Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The <a class="auto-href" href="http://dbpedia.org/resource/Intel_Corporation" id="link-id0x23568308">Intel</a> Nehalem memory subsystem is really pretty good.</p>
<p>




</p>
<p>For reference, we show a run with Virtuoso 6 at 100Mt. </p>

<blockquote>
 <code><pre>
0: 424754.40ms, total: 424829ms

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Seed:                   808080
Number of query mix runs (without warmups): 1 times
min/max Querymix runtime:    424.7544s / 424.7544s
Elapsed runtime:        424.754 seconds
QMpH:                   8.475 query mixes per hour
CQET:                   424.754 seconds average runtime of query mix
CQET (geom.):           424.754 seconds geometric mean runtime of query mix
AQET (geom.):           1.097 seconds geometric mean runtime of query
Throughput:             152.559 BSBM-BI throughput: qph*scale
BI Power:               3281.150 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>


<p>and 8 user </p>

<blockquote>
 <code><pre>
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms
Thread 7: query mix 4: 625522.18ms, total: 625559.09ms
Thread 3: query mix 7: 626247.62ms, total: 626304.96ms
Thread 1: query mix 0: 629675.17ms, total: 629724.98ms
Thread 4: query mix 6: 667633.36ms, total: 667670.07ms
Thread 8: query mix 2: 674206.07ms, total: 674256.72ms
Thread 6: query mix 5: 695020.21ms, total: 695052.29ms
Thread 2: query mix 1: 701824.67ms, total: 701864.91ms
Benchmark run completed in 701.909341s

Scale factor:           284826
Explore Endpoints:      1
Update Endpoints:       1
Drilldown:              on
Number of warmup runs:  0
Number of clients:      8
Seed:                   808080
Number of query mix runs (without warmups): 8 times
min/max Querymix runtime:    616.9979s / 701.8247s
Total runtime (sum):    5237.127 seconds
Elapsed runtime:        5237.127 seconds
QMpH:                   41.031 query mixes per hour
CQET:                   654.641 seconds average runtime of query mix
CQET (geom.):           653.873 seconds geometric mean runtime of query mix
AQET (geom.):           2.557 seconds geometric mean runtime of query
Throughput:             738.557 BSBM-BI throughput: qph*scale
BI Power:               1408.133 BSBM-BI Power: qph*scale (geom)
</pre>
 </code>
</blockquote>




<p>Having the numbers, let us look at the metric and its scaling.  We take the geometric mean of the single-user Power and the multiuser Throughput.</p>


<blockquote>
 <code><pre>
 100 Mt: sqrt ( 7771 * 2178 ); = 4114

1000 Mt: sqrt ( 6967 * 1794 ); = 3535
</pre>
 </code>
</blockquote>


<p>Scaling seems to work; the results are in the same general ballpark.  The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy.  If we made this start one level from the top, its share would drop.  This seems reasonable.</p>

<p>Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.</p>

<p>Also there should be more queries.</p>

<p>At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety.  We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin.  So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL.  On the other hand, BSBM-BI is not very good as a benchmark; <a class="auto-href" href="http://www.tpc.org/" id="link-id0x23872a10">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x28487d98">H</a> is a lot better.  This stands to reason, as TPC-H has had years and years of development and participation by many people.</p>

<p>Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an <code>IN</code> into a <code>JOIN</code> with the <code>IN</code> subquery in the outer loop and doing streaming aggregation.  Q13 cannot be done without a well-optimized <code><a class="auto-href" href="http://dbpedia.org/resource/Hash_join" id="link-id0x24974830">HASH JOIN</a></code> which besides must be partitioned at the larger scales.</p>

<p>Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for.  Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.</p>


<p>In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.</p>

<p>BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a <code>JOIN</code> and the cardinality of a <code>GROUP BY</code>; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed.  I did however have to add some cardinality statistics to get reasonable <code>JOIN</code> order since we always reorder the query regardless of the source formulation.</p>

<p>BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different <code>JOIN</code> orders for different parameter values.  I have not looked into whether this really makes a difference, though.</p>

<p>There are places in BSBM-BI where using a <code>HASH JOIN</code> makes sense.  We do not use <code>HASH JOINs</code> with <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x23cbf908">RDF</a> because there is an index for everything and making a <code>HASH JOIN</code> in the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not do <code>HASH JOINs</code>.  This said, a <code>HASH JOIN</code> in the right place is a lot better than an index lookup.  With TPC-H Q13, our best <code>HASH JOIN</code> is over 2x better than the best <code>INDEX</code>-based <code>JOIN</code>, both being well tuned.  For questions like &quot;count the hairballs made in <a class="auto-href" href="http://dbpedia.org/resource/Germany" id="link-id0x249d3e28">Germany</a> reviewed by Japanese Hello Kitty fans,&quot; where two ends of a <code>JOIN</code> path are fairly selective doing the other as a <code>HASH JOIN</code> is good.  This can, if the <code>JOIN</code> is always cardinality-reducing, even be merged inside an <code>INDEX</code> lookup.  We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.</p>
 

<p>Let us see the profile for a single user 100 Mt run.</p>

<p>The database activity summary is --</p>

<p>
<code>select db_activity (0, &#39;http&#39;);</code>
</p>

<p>
<code> 161.3M rnd  210.2M seq      0 same seg   104.5M same pg  45.08M same par      0 disk      0 spec disk      0B /      0 messages  2.393K fork</code>
</p>


<p>See the post &quot;<a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1b1f3068">What Does BSBM Explore Measure</a>&quot; for an explanation of the numbers.  We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality.  Running with a longer vector size would probably increase performance by getting better locality.  There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --</p>

<p>
<code> 172.4M rnd  220.8M seq      0 same seg   149.6M same pg  10.99M same par     21 disk    861 spec disk      0B /      0 messages     754 fork</code>
</p>


<p>The throughput goes from 1494 to 1779.  We see more hits on the same page, as expected.  We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.</p>

<p>Let us now go back to the first run with vector size 10000.</p>

<p>The top of the CPU <code>oprofile</code> is as follows:</p>

<blockquote>
 <code><pre>
722309   15.4507  cmpf_iri64n_iri64n
434791    9.3005  cmpf_iri64n_iri64n_anyn_iri64n
294712    6.3041  itc_next_set
273488    5.8501  itc_vec_split_search
203970    4.3631  itc_dive_transit
199687    4.2714  itc_page_rcf_search
181614    3.8848  dc_itc_append_any
173043    3.7015  itc_bm_vec_row_check
146727    3.1386  cmpf_int64n
128224    2.7428  itc_vec_row_check
113515    2.4282  dk_alloc
97296     2.0812  page_wait_access
62523     1.3374  qst_vec_get_int64
59014     1.2623  itc_next_set_parent
53589     1.1463  sslr_qst_get
48003     1.0268  ds_add
46641     0.9977  dk_free_tree
44551     0.9530  kc_var_col
43650     0.9337  page_col_cmp_1
35297     0.7550  cmpf_iri64n_iri64n_anyn_gt_lt
34589     0.7399  dv_compare
25864     0.5532  cmpf_iri64n_anyn_iri64n_iri64n_lte
23088     0.4939  dk_free
</pre>
 </code>
</blockquote>

<p>The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with <code>P</code> and <code>S</code> given.  The one after that is with all parts given, corresponding to an existence test.  The existence tests could probably be converted to <code>HASH JOIN</code> lookups to good advantage.  Aggregation and arithmetic are absent.  We should probably add a query like TPC-H Q1 that does nothing but these two.  Considering the overall profile, <code>GROUP BY</code> seems to be around 3%.  We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.</p>

<p>A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set.  Some code sections in the queries with conditional execution and costly tests inside <code>ANDs</code> and <code>ORs</code> would be good.  TPC-H has such in Q21 and Q19.  An <code>OR</code> with existences where there would be gain from good guesses of a subquery&#39;s selectivity would be appropriate.  Also, there should be conditional expressions somewhere with a lot of data, like the <code>CASE-WHEN</code> in TPC-H Q12.</p>

<p>We can make BSBM-BI more interesting by putting in the above.  Also we will have to see where we can profit from <code>HASH JOIN</code>, both small and large.  There should be such places in the workload already so this is a matter of just playing a bit more.</p>

<p>This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1fd1d4e0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d5b07d8">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1dfe6c48">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x197fce30">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1fbf4210">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1beeb1e0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1d7e1818">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1dfc1730">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1ea819a8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1ec73da0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1fbdce90">Benchmarks, Redux (part 11): The Substance of Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x19928618">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f3d8710">Benchmarks, Redux (part 13): BSBM-BI Modifications </a>
</li>
<li>
Benchmarks, Redux (part 14): BSBM-BI Mix  <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e627400">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1686">
  <rss:title>Benchmarks, Redux (part 13): BSBM BI Modifications</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:30:44Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post we introduce changes to the BSBM BI queries and metric. These changes are motivated by prevailing benchmark practice and by our experiences in optimizing for the BSBM BI workload. We will publish results according to the definitions given here and recommend that any interested parties do likewise. The rationales are given in the text. Query Mix We have removed Q4 from the mix because it is quadratic to the scale factor. The other queries are roughly n * log (n). Parameter Substitution All queries that take a product type as parameter are run in flights of several query invocations where the product type goes from broader to more specific. The initial product type specifies either the root product type or an immediate subtype of this, and the last in the drill-down is a leaf type. The rationale for this is that the choice of product type may make several orders of magnitude difference in the run time of a query. In order to make consecutive query mixes roughly comparable in execution time, all mixes should have a predictable number of query invocations with product types of each level. Query Order In the BI mix, when running multiple concurrent clients, each query mix is submitted in a random order. Queries which do drill-downs always have the steps of the drill-down as consecutive in the session, but the query templates are permuted. This is done so as to make less likely that there were two concurrent queries accessing exactly the same data. In this way, scans cannot be trivially shared between queries -- but there are still opportunities for reuse of results and adapting execution to working set, e.g., starting with what is in memory. Metrics We use a TPC-H-like metric. This metric consists of a single-user part and a multi-user part, called respectively Power and Throughput. The Power metric is a geometric mean of query run-time. The Throughput is the total run-time divided by the number of queries completed. After taking the mean, the time is converted into queries-per-hour. This time is then multiplied by the scale factor divided by the scale factor for 100 Mt. In other words, we consider the 100 Mt data set as the unit scale. The Power is defined as ( scale_factor / 284826 ) * 3600 / ( ( t1 * t1 * ... * tn ) ^ ( 1 / n ) ) The Throughput is defined as ( scale_factor / 284826 ) * 3600 / ( ( t1 + t2 + ... + tn ) / n ) The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt). We consider this scale &quot;one&quot;. The reason for the multiplication is that scores at different scales should get similar numbers; otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries. The Composite metric is the geometric mean of the Power and Throughput metrics. A complete report shows both Power and Throughput metrics, as well as individual query times for all queries. The rationale for using a geometric mean is to give an equal importance to long and short queries. Halving the execution time of either a long query or a short query will have the same effect on the metric. This is good for encouraging research into all aspects of query processing. On the other hand, real-life users are more interested in halving the time of queries that take one hour than of queries that take one second; therefore, the throughput metric considers run times. Taking the geometric mean of the two metrics gives more weight to the lower of the two than an arithmetic mean, hence we pay more attention to the worse of the two. Single-user and multi-user metrics are separate because of the relative importance of intra-query parallelization in BI workloads: There may not be large numbers of concurrent users, yet queries are still complex, and it is important to have maximum parallelization. Therefore the metric rewards single-user performance. In the next post we will look at the use of this metric and the actual content of BSBM BI. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): The Substance of Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications (this post) Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[
<p>In this post we introduce changes to the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x234e0ca0">BSBM</a> BI queries and metric. These changes are motivated by prevailing benchmark practice and by our experiences in optimizing for the BSBM BI workload.</p>

<p>We will publish results according to the definitions given here and recommend that any interested parties do likewise.  The rationales are given in the text.</p>


<h3>Query Mix</h3>

<p>We have removed Q4 from the mix because it is quadratic to the scale factor.  The other queries are roughly <code>n * log (n)</code>.  </p>


<h3>Parameter Substitution </h3>

<p>All queries that take a product type as parameter are run in flights of several query invocations where the product type goes from broader to more specific.  The initial product type specifies either the root product type or an immediate subtype of this, and the last in the drill-down is a leaf type.</p>

<p>The rationale for this is that the choice of product type may make several orders of magnitude difference in the run time of a query.  In order to make consecutive query mixes roughly comparable in execution time, all mixes should have a predictable number of query invocations with product types of each level.</p>


<h3>Query Order </h3>

<p>In the BI mix, when running multiple concurrent clients, each query mix is submitted in a random order.  Queries which do drill-downs always have the steps of the drill-down as consecutive in the session, but the query templates are permuted.  This is done so as to make less likely that there were two concurrent queries accessing exactly the same <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x23be8d28">data</a>.  In this way, scans cannot be trivially shared between queries -- but there are still opportunities for reuse of results and adapting execution to working set, e.g., starting with what is in memory.</p>


<h3>Metrics </h3>

<p>We use a <a class="auto-href" href="http://www.tpc.org/" id="link-id0x238c81a0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x28c6bbd8">H</a>-like metric.  This metric consists of a single-user part and a multi-user part, called respectively <i>Power</i> and <i>Throughput.</i>  The <i>Power</i> metric is a geometric mean of query run-time.  The <i>Throughput</i> is the total run-time divided by the number of queries completed.  After taking the mean, the time is converted into queries-per-hour.  This time is then multiplied by the scale factor divided by the scale factor for 100 Mt. In other words, we consider the 100 Mt data set as the unit scale.</p>

<p>The <i>Power</i> is defined as</p>
<blockquote>( scale_factor / 284826 ) *  3600 / ( ( t1 * t1 * ... * tn ) ^ ( 1 / n ) ) </blockquote>
<p>The <i>Throughput</i> is defined as</p>
<blockquote>( scale_factor / 284826 ) *  3600 / ( ( t1 + t2 + ... + tn ) / n ) </blockquote>
<p>The magic number <b><code>284826</code></b> is the scale that generates approximately 100 million triples (100 Mt).  We consider this scale &quot;one&quot;.  The reason for the multiplication is that scores at different scales should get similar numbers; otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.</p>


<p>The <i>Composite</i> metric is the geometric mean of the <i>Power</i> and <i>Throughput</i> metrics.  A complete report shows both <i>Power</i> and <i>Throughput</i> metrics, as well as individual query times for all queries.  The rationale for using a geometric mean is to give an equal importance to long and short queries.  Halving the execution time of either a long query or a short query will have the same effect on the metric.  This is good for encouraging research into all aspects of query processing.  On the other hand, real-life users are more interested in halving the time of queries that take one hour than of queries that take one second; therefore, the throughput metric considers run times.</p>

<p>Taking the geometric mean of the two metrics gives more weight to the lower of the two than an arithmetic mean, hence we pay more attention to the worse of the two.</p>

<p>Single-user and multi-user metrics are separate because of the relative importance of intra-query parallelization in BI workloads: There may not be large numbers of concurrent users, yet queries are still complex, and it is important to have maximum parallelization. Therefore the metric rewards single-user performance.</p>


<p>In the next post we will look at the use of this metric and the actual content of BSBM BI.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1b02d528">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d65f740">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1a797860">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d3538e0">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1e566f60">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1dedffd8">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1eb11528">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1db46c38">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1c8174e8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dfa9338">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1e6dd7b0">Benchmarks, Redux (part 11): The Substance of Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d154bb0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
Benchmarks, Redux (part 13): BSBM BI Modifications <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f242ae0">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ebf2f98">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-22#1685">
  <rss:title>Benchmarks, Redux (part 12): Our Own BSBM Results Report</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-22T22:29:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This is a placeholder; it will be replaced with a complete report in the very near future.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<i>This is a placeholder; it will be replaced with a complete report in the very near future.</i>
</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-10#1680">
  <rss:title>Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-10T23:30:11Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Let us talk about what ought to be benchmarked in the context of RDF. A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else&#39;s problem; besides, it is a solved one. On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item. BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this &quot;RDF-H&quot;. In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI. The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this. The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough. The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere. The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB. There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement. The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation. There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries. Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account. Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted. The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale. For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads. The first would be an online mix with typically short lookups and updates, around O ( log ( n ) ). The Business Intelligence Mix would be composed of queries around OO ( n log ( n ) ). Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of &quot;hard&quot; and &quot;easy&quot; parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values. Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data. One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec. These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload. We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads. We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice. We would finally get to the &quot;raging conflict&quot; between the &quot;declarativists&quot; and the &quot;map reductionists.&quot; Last year&#39;s VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data. We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs. We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth. There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives. This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF&#39;s claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks (this post) Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Let us talk about what ought to be benchmarked in the context of <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x2a84d3c0">RDF</a>.</p>

<p>A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at <a class="auto-href" href="http://www.tpc.org/" id="link-id0x2a9758e8">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x2a8fa2a0">H</a> and similar workloads, and therefore there is no need for RDF to go there.  It is, as it were, somebody else&#39;s problem; besides, it is a solved one.</p>

<p>On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.</p>

<p>
<a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x29c75a30">BSBM</a> seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else.  But we should not lose sight of the fact that this is in fact a relational <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x2a0565b8">schema</a> and workload that has just been straightforwardly transformed to RDF.  BSBM was made, after all, in part for measuring RDB to RDF mapping.  Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be.  TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal.  But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x29c44d50">SQL</a> will keep some advantage.  However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized.  We could publish these and call this &quot;RDF-H&quot;.  </p>

<p>In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark.  The points are in part based on discussions with <a class="auto-href" href="http://nl.linkedin.com/in/peterboncz" id="link-id0x2ac282f0">Peter Boncz</a> of <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x2a1c9e10">CWI</a>.</p>

<p>The <a class="auto-href" href="http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark" id="link-id0x29e7d3d8">Social Network Intelligence Benchmark</a> (<a class="auto-href" href="http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark" id="link-id0x2a70e3c0">SNIB</a>) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM.  In <a class="auto-href" href="http://lod2.eu/" id="link-id0x2a9a70f0">LOD2</a>, CWI is presently working on this.</p>

<p>The <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x2ad04408">data</a> includes <a class="auto-href" href="http://dbpedia.org/resource/DBpedia" id="link-id0x29d5eeb0">DBpedia</a> as a base component used for providing conversation topics, <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x2ac97c40">information</a> about geographical locales of simulated users, etc.  DBpedia is not very large, around 200M-300M triples, but it is diverse enough.</p>

<p>The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.  </p>

<p>The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth.  The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles.  Since there is such correlation, <a class="auto-href" href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x2ac359c0">NLP</a> techniques like <a class="auto-href" href="http://dbpedia.org/resource/Entity" id="link-id0x2a1c8ed0">entity</a> and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.</p>

<p>There is variation in frequency of online interaction, and this interaction consist of sessions.  For example, one could analyze user behavior per time of day for online ad placement.</p>

<p>The data probably should include propagating memes, fashions, and trends that travel on the social network.  With this, one could query about their origin and speed of propagation.</p>

<p>There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.</p>

<p>Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.</p>

<ul>
<li>
  <p>Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.</p>
</li>

<li>
  <p>The dataset should be predictably scalable.  For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.</p>
</li>
</ul>

<p>For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated).  Making a single metric from such parts may not be meaningful.  Therefore, SNIB might be structured into different workloads.</p>

<p>The first would be an online mix with typically short lookups and updates, around <code>O ( log ( n ) )</code>.  </p>

<p>The Business Intelligence Mix would be composed of queries around <code>OO ( n log ( n ) )</code>.  Even so, with real data, choice of parameters will provide dramatic changes in query run-time.  Therefore a run should be specified to have a predictable distribution of &quot;hard&quot; and &quot;easy&quot; parameter choices.  In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy.  This could be done here too in some cases; other cases would have to be defined with buckets of values. </p>

<p>Both the real world and LOD2 are largely concerned with data integration.  The SNIB workload can have aspects of this, for example, in resolving duplicate identities.  These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.</p>

<p>One characteristic of these is the production of sometimes large intermediate results that need to be materialized.  Doing these operations in practice requires procedural control.  Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language.  Some basic graph operations like shortest path are expressible but then are not in unextended <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x29d26588">SPARQL</a> 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.</p>

<p>These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.</p>

<p>We find that such a workload will have procedural sections either in application code or stored procedures.  Map-reduce is sometimes used for scaling these.  As one would expect, many cluster databases have their own version of these control structures.  Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations.  We might here touch base with the <a class="auto-href" href="http://www.larkc.eu/" id="link-id0x29b69640">LarKC</a> map-reduce work to see if it could be applied to SNIB workloads. </p>

<p>We see a three-level structure emerging.  There is an <i>Online</i> mix which is a bit like the BSBM <i>Explore</i> mix, and an <i>Analytics</i> mix which is on the same order of complexity as TPC-H.  These may have a more-or-less fixed query formulation and test driver.  Beyond these, yet working on the same data, we have a set of <i>Predefined Tasks</i> which the test sponsor may implement in a manner of their choice.</p>

<p>We would finally get to the &quot;raging conflict&quot; between the &quot;declarativists&quot; and  the &quot;map reductionists.&quot;  Last year&#39;s VLDB had a lot of map-reduce papers.  I know of comparisons between <a class="auto-href" href="http://www.vertica.com/" id="link-id0x2a8c4510">Vertica</a> and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.</p>

<p>We might even interest some of the cluster <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x2995aaa8">RDBMS</a> players (<a class="auto-href" href="http://www.teradata.com/" id="link-id0x29c9af10">Teradata</a>, Vertica, <a class="auto-href" href="http://dbpedia.org/resource/Greenplum" id="link-id0x29c9af38">Greenplum</a>, <a class="auto-href" href="http://dbpedia.org/page/Oracle_Exadata" id="link-id0x29d48b78">Oracle Exadata</a>, <a class="auto-href" href="http://www.paraccel.com/" id="link-id0x29d48ba0">ParAccel</a>, and/or <a class="auto-href" href="http://www.asterdata.com/" id="link-id0x29bf8fb0">Aster Data</a>, to name a few) in running this workload using their map-reduce analogs.</p>


<p>We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.</p>

<p>There is not, nor ought there to be, a sheltered, RDF-only enclave.  RDF will have to justify itself in a world of alternatives.</p>

<p>This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does.  RDF cannot be a total failure at this, even if this were not RDF&#39;s claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1c9f7ab8">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1dd17b28">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1eb20620">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1f8a5ae8">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ac14a08">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1d1f8d58">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1ea83308">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b548028">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1c3d9c58">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1f5e6978">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c082a28">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ec73578">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1eb25d48">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1b261958">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-10#1679">
  <rss:title>Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-10T23:29:41Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I have in the previous posts generally argued for and demonstrated the usefulness of benchmarks. Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a LOD2 members consensus, but have been discussed in the consortium. My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water! But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking. Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating le chef d&#39;oeuvre culinaire (&quot;the culinary masterpiece&quot;) create it. Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values. Indeed, an intimate knowledge de la vie secrete du canard (&quot;the secret life of duck&quot;) is required in order to liberate the aroma that it might take flight and soar. In the previous, I have shed some light on how we prepare le canard, and if le canard be such then la dinde (turkey) might in some ways be analogous; who is to say? In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice. In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained. This is the TPC (Transaction Processing Performance Council) model. Another culture of doing benchmarks is the periodic challenge model used in TREC, the Billion Triples Challenge, the Semantic Search Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication. A third party performing benchmarks by itself is uncommon in databases. Licenses even often explicitly prohibit this, for understandable reasons. The LOD2 project has an outreach activity called Publink where we offer to help owners of data to publish it as Linked Data. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing RDF store benchmarks. One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results. The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison. Isn&#39;t this the very truth? Let the chefs mix their own spices. This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import. In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question. Increasing the scale remains a stated objective. LOD2 even promised to run things with a trillion triples in another 3 years. Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off? Or would this on the contrary combine strict Justice with edifying Charity? Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice? Even better, CWI, with its stellar database pedigree, agreed in principle to audit RDF benchmarks in LOD2. In this way one could get a stamp of approval for one&#39;s results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs. On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here. I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes. We could even do this unilaterally -- just publish Virtuoso results according to a predefined reporting and verification format. If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings. This could all take place over the net, so we are not talking about any huge cost or prohibitive amount of trouble. It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason. Then there is the matter of the BSBM Business Intelligence (BI) mix. At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer. This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions. Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around. The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well. There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it. If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner. (I will talk about the BI mix in more detail in part 13 and part 14 of this series.) Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit. Of course, this could be done even before then, but the content of the mix might not be settled. We likely need to check it on a few implementations first. For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained. For example, FU Berlin could give people a login to get their recently published results fixed. Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal. As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment. They can set up and tune their systems, and perform the runs. We will just watch. As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data. Like this, both parties get to see the others&#39; technology with proper tuning and installation. What, if anything, is reported about this activity is up to the owner of the technology being tested. We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these. This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user. If you wish to take advantage of this offer, you may contact Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice. The next post will talk about the actual content of benchmarks. The milestone after this will be when we publish the measurement and reporting protocols. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process (this post) Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I have in the previous posts generally argued for and demonstrated the usefulness of benchmarks.</p>

<p>Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a <a class="auto-href" href="http://lod2.eu/" id="link-id0x2acb0760">LOD2</a> members consensus, but have been discussed in the consortium. </p>

<p>My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water!  But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking.  Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating <i>le chef d&#39;oeuvre culinaire</i> (&quot;the culinary masterpiece&quot;) create it.  Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values.  Indeed, an intimate <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x2aca6a30">knowledge</a> <i>de la vie secrete du canard</i> (&quot;the secret life of duck&quot;) is required in order to liberate the aroma that it might take flight and soar.  In the previous, I have shed some light on how we prepare <i>le canard</i>, and if <i>le canard</i> be such then <i>la dinde</i> (turkey) might in some ways be analogous; who is to say?</p>

<p>In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice.  In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained.  This is the <a class="auto-href" href="http://www.tpc.org/" id="link-id0x2b847818">TPC</a> (Transaction Processing Performance Council) model.</p>

<p>Another culture of doing benchmarks is the periodic challenge model used in TREC, the <a class="auto-href" href="http://challenge.semanticweb.org/" id="link-id0x2ac3a6f8">Billion Triples Challenge</a>, the Semantic Search
Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication.</p>

<p>A third party performing benchmarks by itself is uncommon in databases.  Licenses even often explicitly prohibit this, for understandable reasons.</p>

<p>The LOD2 project has an outreach activity called Publink where we offer to help owners of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x2aea5930">data</a> to publish it as <a class="auto-href" href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2a790128">Linked Data</a>. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x29babb00">RDF</a> store benchmarks.</p>

<p>One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results.  The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison.</p>

<p>Isn&#39;t this the very truth?   Let the chefs  mix their own spices.</p>

<p>This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import.</p>

<p>In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question.  Increasing the scale remains a stated objective.  LOD2 even promised to run things with a trillion triples in another 3 years.  </p>

<p>Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off?  Or would this on the contrary combine strict Justice with edifying Charity?  Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice?</p>

<p>Even better, <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x2a21d108">CWI</a>, with its <a href="http://monetdb.cwi.nl/Development/Research/Articles/" id="link-id0x1d6479d0">stellar database pedigree</a>, agreed in principle to audit RDF benchmarks in LOD2. </p>

<p>In this way one could get a stamp of approval for one&#39;s results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs.  On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here.  I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes.</p>

<p>We could even do this unilaterally -- just publish <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x2a0d73d8">Virtuoso</a> results according to a predefined reporting and verification format.  If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings.  This could all take place over the <a class="auto-href" href="http://dbpedia.org/resource/.NET_Framework" id="link-id0x2a6b44a0">net</a>, so we are not talking about any huge cost or prohibitive amount of trouble.  It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason.</p>

<p>Then there is the matter of the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2a1722a8">BSBM</a> Business Intelligence (BI) mix.  At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer.  This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions.  Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around.  The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well.  There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it.  If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner.  </p>

<p>(I will talk about the BI mix in more detail in <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1dfcc038">part 13</a> and <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1edaa388">part 14</a> of this series.)</p>

<p>Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit. </p>

<p>Of course, this could be done even before then, but the content of the mix might not be settled.  We likely need to check it on a few implementations first.</p>

<p>For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained.  For example, FU Berlin could give people a login to get their recently published results fixed.  Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal.</p>

<p>As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment.  They can set up and tune their systems, and perform the runs.  We will just watch.  As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data.  Like this, both parties get to see the others&#39; technology with proper tuning and installation.  What, if anything, is reported about this activity is up to the owner of the technology being tested.  We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these.  This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user.  If you wish to take advantage of this offer, you may contact <a href="mailto:hwilliams@openlinksw.com?subject=Collaborative RDF Benchmark" id="link-id0x1c071100">Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice.</a>
</p>

<p>The next post will talk about the <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x19933fd8">actual content of benchmarks</a>.  The milestone after this will be when we publish the measurement and reporting protocols.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1c554800">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>

<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1ec159e8">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1dd5eb10">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x18f05940">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ed5ef10">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1e9cb130">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1dfa79d8">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1eb6f478">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1de5a918">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1dae9060">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f45fa10">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f49d2b8">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e68e4c8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e353858">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-09#1676">
  <rss:title>Benchmarks, Redux (part 9): BSBM With Cluster</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-09T22:54:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata. At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism. For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn&#39;t run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration. But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable. The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box. 6 Cluster - Load Rates and Times Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 119,204 749 89 200 Mt 121,607 1486 157 1000 Mt 102,694 8737 979 6 Single - Load Rates and Times Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 74,713 1192 145 The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box. Throughput is as follows: 6 Cluster - Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7318 43120 200 Mt 6222 29981 1000 Mt 2526 11156 6 Single - Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487 Below is a snapshot of status during the 6 Cluster 100 Mt run. Cluster 8 nodes, 15 s. 25784 m/s 25682 KB/s 1160% cpu 0% read 740% clw threads 18r 0w 10i buffers 1133459 12 d 4 w 0 pfs cl 1: 10851 m/s 3911 KB/s 597% cpu 0% read 668% clw threads 17r 0w 10i buffers 143992 4 d 0 w 0 pfs cl 2: 2194 m/s 7959 KB/s 107% cpu 0% read 9% clw threads 1r 0w 0i buffers 143616 3 d 2 w 0 pfs cl 3: 2186 m/s 7818 KB/s 107% cpu 0% read 9% clw threads 0r 0w 0i buffers 140787 0 d 0 w 0 pfs cl 4: 2174 m/s 2804 KB/s 77% cpu 0% read 10% clw threads 0r 0w 0i buffers 140654 0 d 2 w 0 pfs cl 5: 2127 m/s 1612 KB/s 71% cpu 0% read 9% clw threads 0r 0w 0i buffers 140949 1 d 0 w 0 pfs cl 6: 2060 m/s 544 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141295 2 d 0 w 0 pfs cl 7: 2072 m/s 517 KB/s 65% cpu 0% read 11% clw threads 0r 0w 0i buffers 141111 1 d 0 w 0 pfs cl 8: 2105 m/s 522 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141055 1 d 0 w 0 pfs The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes. We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput. Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu. This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third. The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized. Looking further at the 6 Cluster status we see the cluster wait (clw) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects. We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster (this post) Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This post is dedicated to our brothers in horizontal partitioning (or sharding), <a class="auto-href" href="http://freebase.com/guid/9202a8c04000641f8000000005c908d6" id="link-id0x2a1e9010">Garlik</a> and <a class="auto-href" href="http://www.systap.com/bigdata.htm" id="link-id0x2acd5218">Bigdata</a>.</p>

<p>At first sight, the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x2bb33648">BSBM</a> <i>Explore</i> mix appears very cluster-unfriendly, as it contains short queries that access <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x2b8fffb8">data</a> at random. There is every opportunity for latency and few opportunities for parallelism.</p>

<p>For this reason we had not even run the BSBM mix with <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x2a84b780">Virtuoso</a> Cluster. We were not surprised to learn that <a href="http://steveharris.tumblr.com/post/3453040647/bsbm-v3-post-mortem" id="link-id0x1c4ef8d8">Garlik hadn&#39;t run BSBM either</a>. We have understood from <a class="auto-href" href="http://www.systap.com/" id="link-id0x2ad3d050">Systap</a> that their Bigdata BSBM experiments were on a single-process configuration.</p>

<p>But the 4Store results in the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1f8090f8">recent Berlin report</a> were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.</p>

<p>The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its <a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x2ac28380">HTTP</a> and <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x2bba8720">SQL</a> listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">6 Cluster - Load Rates and Times</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 119,204 </td>
		<td align="center"> 749 </td>
		<td align="center"> 89 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 121,607 </td>
		<td align="center"> 1486 </td>
		<td align="center"> 157 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 102,694 </td>
		<td align="center"> 8737 </td>
		<td align="center"> 979 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">6 Single - Load Rates and Times</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 74,713 </td>
		<td align="center"> 1192 </td>
		<td align="center"> 145 </td>
	</tr>
</table>



<p>The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x296b03b8">CPU</a> utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.</p>

<p>Throughput is as follows:</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 6 Cluster - Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7318 </td>
		<td align="center"> 43120 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 6222 </td>
		<td align="center"> 29981 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 2526 </td>
		<td align="center"> 11156 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 6 Single - Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7641 </td>
		<td align="center"> 29433 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 6017 </td>
		<td align="center"> 13335 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 1770 </td>
		<td align="center"> 2487 </td>
	</tr>
</table>


<p>Below is a snapshot of status during the 6 Cluster 100 Mt run.</p>

<blockquote>
 <code><pre>
Cluster 8 nodes, 15 s.
       25784 m/s  25682 KB/s  1160% cpu  0% read  740% clw  threads 18r 0w 10i  buffers 1133459  12 d  4 w  0 pfs
cl 1:  10851 m/s   3911 KB/s   597% cpu  0% read  668% clw  threads 17r 0w 10i  buffers  143992   4 d  0 w  0 pfs
cl 2:   2194 m/s   7959 KB/s   107% cpu  0% read    9% clw  threads  1r 0w  0i  buffers  143616   3 d  2 w  0 pfs
cl 3:   2186 m/s   7818 KB/s   107% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140787   0 d  0 w  0 pfs
cl 4:   2174 m/s   2804 KB/s    77% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  140654   0 d  2 w  0 pfs
cl 5:   2127 m/s   1612 KB/s    71% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140949   1 d  0 w  0 pfs
cl 6:   2060 m/s    544 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141295   2 d  0 w  0 pfs
cl 7:   2072 m/s    517 KB/s    65% cpu  0% read   11% clw  threads  0r 0w  0i  buffers  141111   1 d  0 w  0 pfs
cl 8:   2105 m/s    522 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141055   1 d  0 w  0 pfs
</pre>
 </code>
</blockquote>


<p>The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes. </p>

<p>We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput. </p>


<p>Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.</p>

<p>This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in <i><a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1e9a0b58">A Benchmarking Story</a></i>. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.</p>


<p>The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.</p>

<p>Looking further at the 6 Cluster status we see the cluster wait (<code>clw</code>) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.</p>


<p>We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.</p>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d7894d0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1e434888">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1f6b5260">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1dd29460">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1f0d78b8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1f9a9670">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1c055370">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1dc06cd0">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
Benchmarks, Redux (part 9): BSBM With Cluster <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x18f04db0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1ee729b8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e2e76b8">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d75ef48">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ee518c0">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d9244b0">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-09#1674">
  <rss:title>Benchmarks, Redux (part 8): BSBM Explore and Update </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-09T17:32:47Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We will here look at the Explore and Update scenario of BSBM. This presents us with a novel problem as the specification does not address any aspect of ACID. A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability. We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning SERIALIZABLE isolation between INSERTs and DELETEs, and READ COMMITTED isolation between READ and any UPDATE transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.) Virtuoso 7 Single, Full ACID (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 9,969 65,537 200 Mt 8,646 40,527 1000 Mt 5,512 17,293 Virtuoso 6 Cluster, Full ACID (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 5604.520 34079.019 1000 Mt 2866.616 10028.325 Virtuoso 6 Single, Full ACID (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7,152 21,065 200 Mt 5,862 16,895 1000 Mt 1,542 4,548 Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs. The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability. In the following we will look at transactions and at how the definition of the workload and reporting could be made complete. Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such &quot;transactions,&quot; the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable. Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as READ COMMITTED. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable. Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale. We look at the pertinent statistics: SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC; KEY_TABLE INDEX_NAME LOCKS WAITS WAIT_PCT DEADLOCKS LOCK_ESC WAIT_MSECS =============== ============= ====== ===== ======== ========= ======== ========== DB.DBA.RDF_QUAD RDF_QUAD_POGS 179205 934 0 0 0 35164 DB.DBA.RDF_IRI RDF_IRI 20752 217 1 0 0 16445 DB.DBA.RDF_QUAD RDF_QUAD_SP 9244 3 0 0 0 235 We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions. We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert. Considering these results it does not seem crucial to &quot;drop ACID,&quot; though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID. Let us now see what the benchmark writes: SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC; KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS =========================== ============================ ========= ======= ======== ======= ========= DB.DBA.RDF_QUAD RDF_QUAD_POGS 763846891 237436 0 58040 228606 DB.DBA.RDF_QUAD RDF_QUAD 213282706 1991836 0 30226 1940280 DB.DBA.RDF_OBJ RO_VAL 15474 17837 115 13438 17431 DB.DBA.RO_START RO_START 10573 11195 105 10228 11227 DB.DBA.RDF_IRI RDF_IRI 61902 125711 203 7705 121300 DB.DBA.RDF_OBJ RDF_OBJ 23809053 3205963 13 636 3072517 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 3237687 504486 15 340 488797 DB.DBA.RDF_QUAD RDF_QUAD_SP 89995 70446 78 99 68340 DB.DBA.RDF_QUAD RDF_QUAD_OP 19440 47541 244 66 45583 DB.DBA.VTLOG_DB_DBA_RDF_OBJ VTLOG_DB_DBA_RDF_OBJ 3014 1 0 11 11 DB.DBA.RDF_QUAD RDF_QUAD_GS 1261 801 63 10 751 DB.DBA.RDF_PREFIX RDF_PREFIX 14 168 1120 1 153 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1807 200 11 1 200 The most dirty pages are on the POGS index, which is reasonable; values are spread out at random. After this we have the PSOG index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect. * * * Going back to BSBM, at least the following aspects of the benchmark have to be further specified: Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc. Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state. Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described. Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up). With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update (this post) Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We will here look at the <i>Explore and Update</i> scenario of <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1c064218">BSBM</a>. This presents us with a novel problem as the specification does not address any aspect of <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x1c1852b0">ACID</a>.</p>

<p>A transaction benchmark ought to have something to say about this. The <a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1dbca228">SPARUL</a> (also known as <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1eaa4fd0">SPARQL</a>/<a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1dd12bb0">Update</a>) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.</p>

<p>We begin by running <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1c5f4830">Virtuoso</a> 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning <code>SERIALIZABLE</code> isolation between <code>INSERTs</code> and <code>DELETEs</code>, and <code>READ COMMITTED</code> isolation between <code>READ</code> and any <code>UPDATE</code> transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)</p>


<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> Virtuoso 7 Single, Full ACID <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 9,969 </td>
		<td align="center"> 65,537 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 8,646 </td>
		<td align="center"> 40,527 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 5,512 </td>
		<td align="center"> 17,293 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> Virtuoso 6 Cluster, Full ACID <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center"> Scale </th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center"> 100 Mt </th>
		<td align="center"> 5604.520 </td>
		<td align="center"> 34079.019 </td>
	</tr>
	<tr>
		<th align="center"> 1000 Mt </th>
		<td align="center"> 2866.616 </td>
		<td align="center"> 10028.325 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> Virtuoso 6 Single, Full ACID <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7,152 </td>
		<td align="center"> 21,065 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 5,862 </td>
		<td align="center"> 16,895 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 1,542 </td>
		<td align="center"> 4,548 </td>
	</tr>
</table>



<p>Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1d4f13d8">cache</a>; see <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1f8ac510">previous post on read-ahead</a> for details. All runs do 1000 <i>Explore and Update</i> mixes. The initial database is in the state following the <i>Explore</i> only runs.</p>

<p>The results are in line with the <i>Explore</i> results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.</p>

<p>In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.</p>


<p>Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such &quot;transactions,&quot; the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1ec05c10">data</a> already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.</p>


<p>Read-only transactions (i.e., the <i>Explore</i> mix in the <i>Explore and Update</i> scenario) will be run as <code>READ COMMITTED</code>. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.</p>

<p>Our first point of call is to determine the cost of ACID. We run 1000 mixes of <i>Explore and Update</i> at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.</p>

<p>We look at the pertinent statistics:</p>

<p>
<code></code>
</p>
<pre>
SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
</pre>

<blockquote>
 <code><pre>
KEY_TABLE         INDEX_NAME       LOCKS   WAITS   WAIT_PCT   DEADLOCKS   LOCK_ESC   WAIT_MSECS
===============   =============   ======   =====   ========   =========   ========   ==========
DB.DBA.<a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x180837c8">RDF</a>_QUAD   RDF_QUAD_POGS   179205     934          0           0          0        35164
DB.DBA.RDF_IRI    RDF_IRI          20752     217          1           0          0        16445
DB.DBA.RDF_QUAD   RDF_QUAD_SP       9244       3          0           0          0          235
</pre>
 </code>
</blockquote>

<p>We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions. </p>

<p>We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.</p>

<p>Considering these results it does not seem crucial to &quot;drop ACID,&quot; though doing so would save <i>some</i> time. We will now run measurements for all scales with 16 Users and ACID. </p>

<p>Let us now see what the benchmark writes:</p>

<p>
<code></code>
</p>
<pre>
SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
</pre>

<blockquote>
 <code><pre>
KEY_TABLE                     INDEX_NAME                       TOUCHES     READS   READ_PCT   N_DIRTY   N_BUFFERS
===========================   ============================   =========   =======   ========   =======   =========
DB.DBA.RDF_QUAD               RDF_QUAD_POGS                  763846891    237436          0     58040      228606
DB.DBA.RDF_QUAD               RDF_QUAD                       213282706   1991836          0     30226     1940280
DB.DBA.RDF_OBJ                RO_VAL                             15474     17837        115     13438       17431
DB.DBA.RO_START               RO_START                           10573     11195        105     10228       11227
DB.DBA.RDF_IRI                RDF_IRI                            61902    125711        203      7705      121300
DB.DBA.RDF_OBJ                RDF_OBJ                         23809053   3205963         13       636     3072517
DB.DBA.RDF_IRI                DB_DBA_RDF_IRI_UNQC_RI_ID        3237687    504486         15       340      488797
DB.DBA.RDF_QUAD               RDF_QUAD_SP                        89995     70446         78        99       68340
DB.DBA.RDF_QUAD               RDF_QUAD_OP                        19440     47541        244        66       45583
DB.DBA.VTLOG_DB_DBA_RDF_OBJ   VTLOG_DB_DBA_RDF_OBJ                3014         1          0        11          11
DB.DBA.RDF_QUAD               RDF_QUAD_GS                         1261       801         63        10         751
DB.DBA.RDF_PREFIX             RDF_PREFIX                            14       168       1120         1         153
DB.DBA.RDF_PREFIX             DB_DBA_RDF_PREFIX_UNQC_RP_ID        1807       200         11         1         200
</pre>
 </code>
</blockquote>


<p>The most dirty pages are on the <code>POGS</code> index, which is reasonable; values are spread out at random. After this we have the <code>PSOG</code> index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.</p>

<p align="center">* * *</p>

<p>Going back to BSBM, at least the following aspects of the benchmark have to be further specified:</p>

<ul>
<li>
  <p>
    <b>Disclosure of ACID properties.</b> If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.</p>
</li>

<li>
  <p>
    <b>Steady state.</b> The working set of the <i>Update</i> mix is different from that of the <i>Explore</i> mixes. This touches more indices than <i>Explore</i>. The <i>Explore</i> warm-up is in part good but does not represent steady state.</p>
</li>

<li>
  <p>
    <b>Checkpoint and sustained throughput.</b> Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.</p>
</li>

<li>
  <p>
    <b>Memory and warm-up issues.</b>We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).</p>
</li>
</ul>


<p>With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1de61db8">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1f9f96f8">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1f89eeb0">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1ad83f30">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1de62178">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1b2ec018">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1ae6f028">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
Benchmarks, Redux (part 8): BSBM Explore and Update <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x132605c0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1a9871b0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1baa20f8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e25a840">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1b53db20">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e7ce520">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1b18f400">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>

</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-07#1672">
  <rss:title>Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-07T23:39:22Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We will here analyze what the BSBM Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales. We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles. But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have: SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC; KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 114105938 3302150 2 0 3171275 DB.DBA.RDF_QUAD RDF_QUAD 977426773 2041156 0 0 1970712 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 8250414 509239 6 15 491631 DB.DBA.RDF_QUAD RDF_QUAD_POGS 3677233812 183860 0 0 175386 DB.DBA.RDF_IRI RDF_IRI 32 99710 302151 5 95353 DB.DBA.RDF_QUAD RDF_QUAD_OP 30597 51593 168 0 48941 DB.DBA.RDF_QUAD RDF_QUAD_SP 265474 47210 17 0 46078 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 6020 212 3 0 212 DB.DBA.RDF_PREFIX RDF_PREFIX 0 167 16700 0 157 The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the PSOG index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison. Now let us reset the counts and see what the steady state I/O profile is. SELECT key_stat (key_table, name_part (key_name, 2), &#39;reset&#39;) FROM sys_keys WHERE key_migrate_to IS NULL; SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC; KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 30155789 79659 0 0 3191391 DB.DBA.RDF_QUAD RDF_QUAD 259008064 8904 0 0 1948707 DB.DBA.RDF_QUAD RDF_QUAD_SP 68002 7730 11 0 53360 DB.DBA.RDF_IRI RDF_IRI 12 5415 41653 6 98804 DB.DBA.RDF_QUAD RDF_QUAD_POGS 975147136 1597 0 0 173459 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 2213525 1286 0 17 485093 DB.DBA.RDF_QUAD RDF_QUAD_OP 7999 904 11 0 48568 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1494 1 0 0 213 Literal strings dominate. The SP index is used only for situations where the P is not specified, i.e., the DESCRIBE query. Based on this, I/O seems to be attributable mostly to this. The first RDF_IRI represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first RDF_IRI is not properly recorded, hence the miss % is out of line. We see SP missing the cache the most since its use is infrequent in the mix. We will next look at query processing statistics. For this we introduce a new meter. The db_activity SQL function provides a session-by-session cumulative statistic of activity. The fields are: rnd - Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found. seq - Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match. same seg - For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection. same pg - Counts how many times a vectored index join finds the next match on the same page as the previous one. same par - Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent. disk - Counts how many disk reads were made, including any speculative reads initiated. spec disk - Counts speculative disk reads. messages - Counts cluster interconnect messages B (KB, MB, GB) - is the total length of the cluster interconnect messages. fork - Counts how many times a thread was forked (started) for query parallelization. The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000). We run 2000 query mixes with 16 Users. The special http account keeps a cumulative account of all activity on web server threads. SELECT db_activity (2, &#39;http&#39;); 1.674G rnd  3.223G seq      0 same seg  1.286G same pg  314.8M same par  6.186M disk  6.461M spec disk      0B /     0 messages  298.6K fork We see that random access dominates. The seq number is about twice the rnd number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the same seg is 0; the same pg indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one. There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes. Now let us look at the same reading after 2000 mixes, 16 user at 100Mt. 234.3M rnd  420.5M seq      0 same seg   188.8M same pg  29.09M same par  808.9K disk  919.9K spec disk      0B /      0 messages  76K fork We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work. We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work. We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there. To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales. For 100 Mt: 61161 10.1723 cmpf_iri64n_iri64n_anyn_gt_lt 31321 5.2093 box_equal 19027 3.1646 sqlo_parse_tree_has_node 15905 2.6453 dk_alloc 15647 2.6024 itc_next_set_neq 12702 2.1126 itc_vec_split_search 12487 2.0768 itc_dive_transit 11450 1.9044 itc_bm_vec_row_check 10646 1.7706 itc_page_rcf_search 9223 1.5340 id_hash_get 9215 1.5326 gen_qsort 8867 1.4748 sqlo_key_part_best 8807 1.4648 itc_param_cmp 8062 1.3409 cmpf_iri64n_iri64n 6820 1.1343 sqlo_in_list 6005 0.9987 dc_iri_id_cmp 5905 0.9821 dk_free_tree 5801 0.9648 box_hash 5509 0.9163 dks_esc_write 5444 0.9054 sql_tree_hash_1 For 1000 Mt 754331 31.4149 cmpf_iri64n_iri64n_anyn_gt_lt 146165 6.0872 itc_vec_split_search 144795 6.0301 itc_next_set_neq 131671 5.4836 itc_dive_transit 110870 4.6173 itc_page_rcf_search 66780 2.7811 gen_qsort 66434 2.7667 itc_param_cmp 58450 2.4342 itc_bm_vec_row_check 55213 2.2994 dk_alloc 47793 1.9904 cmpf_iri64n_iri64n 44277 1.8440 dc_iri_id_cmp 39489 1.6446 cmpf_int64n 36880 1.5359 dc_append_bytes 36601 1.5243 dv_compare 31286 1.3029 dc_any_value_prefetch 25457 1.0602 itc_next_set 20852 0.8684 box_equal 19895 0.8285 dk_free_tree 19698 0.8203 itc_page_insert_search 19367 0.8066 dc_copy The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is box_equal with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile. From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt. We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales. This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix. Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? (this post) Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We will here analyze what the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1db49f28">BSBM</a> Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales.</p>

<p>We will here look at database-running statistics for BSBM at different scales. Finally, we look at <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1f150460">CPU</a> profiles.</p>


<p>But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:</p>

<p>
<code>SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;</code>
</p>

<blockquote>
 <code><pre>
KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.<a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1ddb0b50">RDF</a>_OBJ     RDF_OBJ                        114105938  3302150         2        0    3171275
DB.DBA.RDF_QUAD    RDF_QUAD                       977426773  2041156         0        0    1970712
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        8250414   509239         6       15     491631
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                 3677233812   183860         0        0     175386
DB.DBA.RDF_IRI     RDF_IRI                               32    99710    302151        5      95353
DB.DBA.RDF_QUAD    RDF_QUAD_OP                        30597    51593       168        0      48941
DB.DBA.RDF_QUAD    RDF_QUAD_SP                       265474    47210        17        0      46078
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        6020      212         3        0        212
DB.DBA.RDF_PREFIX  RDF_PREFIX                             0      167     16700        0        157
</pre>
 </code>
</blockquote>


<p>The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1d4f9808">data</a> structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the <code>PSOG</code> index for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.</p>

<p>Now let us reset the counts and see what the steady state I/O profile is.</p>

<p>
<code>SELECT key_stat (key_table, name_part (key_name, 2), &#39;reset&#39;) FROM sys_keys WHERE key_migrate_to IS NULL;</code>
</p>
<p>
<code>SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;</code>
</p>

<blockquote>
 <code><pre>
KEY_TABLE          INDEX_NAME                       TOUCHES    READS  READ_PCT  N_DIRTY  N_BUFFERS
=================  ============================  ==========  =======  ========  =======  =========
DB.DBA.RDF_OBJ     RDF_OBJ                         30155789    79659         0        0    3191391
DB.DBA.RDF_QUAD    RDF_QUAD                       259008064     8904         0        0    1948707
DB.DBA.RDF_QUAD    RDF_QUAD_SP                        68002     7730        11        0      53360
DB.DBA.RDF_IRI     RDF_IRI                               12     5415     41653        6      98804
DB.DBA.RDF_QUAD    RDF_QUAD_POGS                  975147136     1597         0        0     173459
DB.DBA.RDF_IRI     DB_DBA_RDF_IRI_UNQC_RI_ID        2213525     1286         0       17     485093
DB.DBA.RDF_QUAD    RDF_QUAD_OP                         7999      904        11        0      48568
DB.DBA.RDF_PREFIX  DB_DBA_RDF_PREFIX_UNQC_RP_ID        1494        1         0        0        213
</pre>
 </code>
</blockquote>



<p>Literal strings dominate. The <code>SP</code> index is used only for situations where the <code>P</code> is not specified, i.e., the <code>DESCRIBE</code> query. Based on this, I/O seems to be attributable mostly to this. The first <code>RDF_IRI</code> represents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the first <code>RDF_IRI</code> is not properly recorded, hence the miss % is out of line. We see <code>SP</code> missing the <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x17d2e670">cache</a> the most since its use is infrequent in the mix.</p>


<p>We will next look at query processing statistics. For this we introduce a new meter.</p>

<p>The <code>db_activity</code> <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x1d4915b8">SQL</a> function provides a session-by-session cumulative statistic of activity. The fields are: </p>

<ul>
<li>
  <b><code>rnd</code>
  </b> - Count of <i>random index lookups</i>. Each first row of a select or insert counts as one, regardless of whether something was found.</li>
<li>
  <b><code>seq</code>
  </b> - Count of <i>sequential rows</i>. Every move to next row on a cursor counts as 1, regardless of whether conditions match.</li>
<li>
  <b><code>same seg</code>
  </b> - For column store only; counts how many times the next row in a vectored join using an index falls in the <i>same segment</i> as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection.</li>
<li>
  <b><code>same pg</code>
  </b> - Counts how many times a vectored index join finds the next match on the <i>same page</i> as the previous one.</li>
<li>
  <b><code>same par</code>
  </b> - Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the <i>same parent</i>.</li>
<li>
  <b><code>disk</code>
  </b> - Counts how many <i>disk reads</i> were made, including any speculative reads initiated.</li>
<li>
  <b><code>spec disk</code>
  </b> - Counts <i>speculative disk reads</i>.</li>
<li>
  <b><code>messages</code>
  </b> - Counts <i>cluster interconnect messages</i> </li>
<li>
  <b><code>B (KB, MB, GB)</code>
  </b> - is the <i>total length</i> of the cluster interconnect messages.</li>
<li>
  <b><code>fork</code>
  </b> - Counts how many times a <i>thread was forked (started)</i> for query parallelization.</li>
</ul>

<p>The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).</p>

<p>We run 2000 query mixes with 16 Users. The special <code><a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1bf7f318">http</a></code> account keeps a cumulative account of all activity on web server threads.</p>

<blockquote>
<p>
  <code>SELECT db_activity (2, &#39;http&#39;);</code>
</p>
<p>
  <code>1.674G rnd  3.223G seq      0 same seg   1.286G same pg  314.8M same par  6.186M disk  6.461M spec disk      0B /     0 messages  298.6K fork</code>
</p>
</blockquote>

<p>We see that random access dominates. The <code>seq</code> number is about twice the <code>rnd</code> number, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, the <code>same seg</code> is 0; the <code>same pg</code> indicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.</p>

<p>There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.</p>

<p>Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.</p>

<blockquote>
<p>
  <code>234.3M rnd  420.5M seq      0 same seg   188.8M same pg  29.09M same par  808.9K disk  919.9K spec disk      0B /      0 messages     76K fork</code>
</p>
</blockquote>


<p>We note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.</p>

<p>We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work. </p>

<p>We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor <i>per se</i>. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.</p>

<p>To elucidate this last question, we look at the CPU profiles. We take an <a href="http://oprofile.sourceforge.net/about/" id="link-id0x1efb3360">oprofile</a> of 100 Single User mixes at both scales.</p>

For 100 Mt:

<blockquote>
 <code><pre>
61161    10.1723  cmpf_iri64n_iri64n_anyn_gt_lt
31321     5.2093  box_equal
19027     3.1646  sqlo_parse_tree_has_node
15905     2.6453  dk_alloc
15647     2.6024  itc_next_set_neq
12702     2.1126  itc_vec_split_search
12487     2.0768  itc_dive_transit
11450     1.9044  itc_bm_vec_row_check
10646     1.7706  itc_page_rcf_search
 9223     1.5340  id_hash_get
 9215     1.5326  gen_qsort
 8867     1.4748  sqlo_key_part_best
 8807     1.4648  itc_param_cmp
 8062     1.3409  cmpf_iri64n_iri64n
 6820     1.1343  sqlo_in_list
 6005     0.9987  dc_iri_id_cmp
 5905     0.9821  dk_free_tree
 5801     0.9648  box_hash
 5509     0.9163  dks_esc_write
 5444     0.9054  sql_tree_hash_1
</pre>
 </code>
</blockquote>


For 1000 Mt

<blockquote>
 <code><pre>
754331   31.4149  cmpf_iri64n_iri64n_anyn_gt_lt
146165    6.0872  itc_vec_split_search
144795    6.0301  itc_next_set_neq
131671    5.4836  itc_dive_transit
110870    4.6173  itc_page_rcf_search
 66780    2.7811  gen_qsort
 66434    2.7667  itc_param_cmp
 58450    2.4342  itc_bm_vec_row_check
 55213    2.2994  dk_alloc
 47793    1.9904  cmpf_iri64n_iri64n
 44277    1.8440  dc_iri_id_cmp
 39489    1.6446  cmpf_int64n
 36880    1.5359  dc_append_bytes
 36601    1.5243  dv_compare
 31286    1.3029  dc_any_value_prefetch
 25457    1.0602  itc_next_set
 20852    0.8684  box_equal
 19895    0.8285  dk_free_tree
 19698    0.8203  itc_page_insert_search
 19367    0.8066  dc_copy
</pre>
 </code>
</blockquote>


<p>The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1bf8ca38">optimization</a> is about 6.5 times greater. The top function in this category is <code>box_equal</code> with 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.</p>

<p>From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1be12350">Virtuoso</a> is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.</p>



<p>We may conclude that different BSBM scales measure different things. The <a class="auto-href" href="http://www.tpc.org/" id="link-id0x17eb98a0">TPC</a> workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.</p>


<p>This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.</p>

<p>Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks. </p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li> <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1a9bcff8">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d3e5470">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1de94770">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1ea66470">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1f1118d8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1d1c0cd8">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
 Benchmarks, Redux (part 7): What Does BSBM Explore Measure? <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1aaf4180">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1a957610">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x127e75c8">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1c9400f0">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d2c1d68">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ea1fb40">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c073a10">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c5541e8">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-07#1670">
  <rss:title>Benchmarks, Redux (part 6): BSBM and I/O, continued</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-07T22:36:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents. So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in. Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an &quot;elevator only&quot; scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty. We keep in mind that the test we target is BSBM warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput. We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run. We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread&#39;s random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow. In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset. In Virtuoso 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the cache quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set. The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch. With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% CPU. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages. The BSBM workload does not offer better possibilities for optimization, short of pre-reading the whole database, which is not practical at large scales. Some Details First we start from cold disk, with and without mandatory read of the whole extent on the touch. Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes: 0: 151560.82 ms, total: 151718 ms 1: 179589.08 ms, total: 179648 ms 2: 71974.49 ms, total: 72017 ms 3: 102701.73 ms, total: 102729 ms 4: 58834.41 ms, total: 58856 ms 5: 65926.34 ms, total: 65944 ms 6: 68244.69 ms, total: 68274 ms 7: 39197.15 ms, total: 39215 ms 8: 45654.93 ms, total: 45674 ms 9: 34850.30 ms, total: 34878 ms 10: 100061.30 ms, total: 100079 ms The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms. With vectored read-ahead and full extents only, i.e., max speculation: 0: 178854.23 ms, total: 179034 ms 1: 110826.68 ms, total: 110887 ms 2: 19896.11 ms, total: 19941 ms 3: 36724.43 ms, total: 36753 ms 4: 21253.70 ms, total: 21285 ms 5: 18417.73 ms, total: 18439 ms 6: 21668.92 ms, total: 21690 ms 7: 12236.49 ms, total: 12267 ms 8: 14922.74 ms, total: 14945 ms 9: 11502.96 ms, total: 11523 ms 10: 15762.34 ms, total: 15792 ms ... 90: 1747.62 ms, total: 1761 ms 91: 1701.01 ms, total: 1714 ms 92: 1300.62 ms, total: 1318 ms 93: 1873.15 ms, total: 1886 ms 94: 1508.24 ms, total: 1524 ms 95: 1748.15 ms, total: 1761 ms 96: 2076.92 ms, total: 2090 ms 97: 2199.38 ms, total: 2212 ms 98: 2305.75 ms, total: 2319 ms 99: 1771.91 ms, total: 1784 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 1.3006s / 178.8542s Elapsed runtime: 872.993 seconds QMpH: 412.374 query mixes per hour The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%. We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better. Then the same with cold SSDs. First with no speculation: 0: 5177.68 ms, total: 5302 ms 1: 2570.16 ms, total: 2614 ms 2: 1353.06 ms, total: 1391 ms 3: 1957.63 ms, total: 1978 ms 4: 1371.13 ms, total: 1386 ms 5: 1765.55 ms, total: 1781 ms 6: 1658.23 ms, total: 1673 ms 7: 1273.87 ms, total: 1289 ms 8: 1355.19 ms, total: 1380 ms 9: 1152.78 ms, total: 1167 ms 10: 1787.91 ms, total: 1802 ms ... 90: 1116.25 ms, total: 1128 ms 91: 989.50 ms, total: 1001 ms 92: 833.24 ms, total: 844 ms 93: 1137.83 ms, total: 1150 ms 94: 969.47 ms, total: 982 ms 95: 1138.04 ms, total: 1149 ms 96: 1155.98 ms, total: 1168 ms 97: 1178.15 ms, total: 1193 ms 98: 1120.18 ms, total: 1132 ms 99: 1013.16 ms, total: 1025 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.8201s / 5.1777s Elapsed runtime: 127.555 seconds QMpH: 2822.321 query mixes per hour The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%. Now, SSDs with max speculation. 0: 44670.34 ms, total: 44809 ms 1: 18490.44 ms, total: 18548 ms 2: 7306.12 ms, total: 7353 ms 3: 9452.66 ms, total: 9485 ms 4: 5648.56 ms, total: 5668 ms 5: 5493.21 ms, total: 5511 ms 6: 5951.48 ms, total: 5970 ms 7: 3815.59 ms, total: 3834 ms 8: 4560.71 ms, total: 4579 ms 9: 3523.74 ms, total: 3543 ms 10: 4724.04 ms, total: 4741 ms ... 90: 673.53 ms, total: 685 ms 91: 534.62 ms, total: 545 ms 92: 730.81 ms, total: 742 ms 93: 1358.14 ms, total: 1370 ms 94: 1098.64 ms, total: 1110 ms 95: 1232.20 ms, total: 1243 ms 96: 1259.57 ms, total: 1273 ms 97: 1298.95 ms, total: 1310 ms 98: 1156.01 ms, total: 1166 ms 99: 1025.45 ms, total: 1034 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.4725s / 44.6703s Elapsed runtime: 269.323 seconds QMpH: 1336.683 query mixes per hour The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%. The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%. We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however. We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower. We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with cat file /dev/null and two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with the cat to /dev/null figure. We will later test with 8 SSDs with better controllers. Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended. Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM Explore in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the data that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving RDF literal strings, presumably on behalf of the DESCRIBE query in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized. The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the DESCRIBE. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect. Conclusions We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs. More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns. As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued (this post) Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents.</p>

<p>So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in. </p>

<p>Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an &quot;elevator only&quot; scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty.</p>


<p>We keep in mind that the test we target is <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x17c88010">BSBM</a> warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput.</p>

<p>We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run.</p>

<p>We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread&#39;s random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow. </p>

<p>In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset. </p>

<p>In <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1c51fae0">Virtuoso</a> 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1d6a0cf0">cache</a> quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set.</p>

<p>The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch.</p>

<p>With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1d295448">CPU</a>. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages.</p>

<p>The BSBM workload does not offer better possibilities for <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1aca8b40">optimization</a>, short of pre-reading the whole database, which is not practical at large scales. </p>

<h2>Some Details</h2>

<p>First we start from cold disk, with and without mandatory read of the whole extent on the touch.</p>

<p>Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes:</p>

<blockquote>
 <code><pre>
 0: 151560.82 ms, total: 151718 ms
 1: 179589.08 ms, total: 179648 ms
 2:  71974.49 ms, total:  72017 ms
 3: 102701.73 ms, total: 102729 ms
 4:  58834.41 ms, total:  58856 ms
 5:  65926.34 ms, total:  65944 ms
 6:  68244.69 ms, total:  68274 ms
 7:  39197.15 ms, total:  39215 ms
 8:  45654.93 ms, total:  45674 ms
 9:  34850.30 ms, total:  34878 ms
10: 100061.30 ms, total: 100079 ms
</pre>
 </code>
</blockquote>

<p>The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms. </p>

<p>With vectored read-ahead and full extents only, i.e., max speculation:</p>

<blockquote>
 <code><pre>
 0: 178854.23 ms, total: 179034 ms
 1: 110826.68 ms, total: 110887 ms
 2:  19896.11 ms, total:  19941 ms
 3:  36724.43 ms, total:  36753 ms
 4:  21253.70 ms, total:  21285 ms
 5:  18417.73 ms, total:  18439 ms
 6:  21668.92 ms, total:  21690 ms
 7:  12236.49 ms, total:  12267 ms
 8:  14922.74 ms, total:  14945 ms
 9:  11502.96 ms, total:  11523 ms
10:  15762.34 ms, total:  15792 ms
...

90:   1747.62 ms, total:   1761 ms
91:   1701.01 ms, total:   1714 ms
92:   1300.62 ms, total:   1318 ms
93:   1873.15 ms, total:   1886 ms
94:   1508.24 ms, total:   1524 ms
95:   1748.15 ms, total:   1761 ms
96:   2076.92 ms, total:   2090 ms
97:   2199.38 ms, total:   2212 ms
98:   2305.75 ms, total:   2319 ms
99:   1771.91 ms, total:   1784 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  1.3006s / 178.8542s
Elapsed runtime:           872.993 seconds
QMpH:                      412.374 query mixes per hour
</pre>
 </code>
</blockquote>


<p>The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%.</p>

<p>We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better. </p>



<p>Then the same with cold SSDs. First with no speculation:</p>

<blockquote>
 <code><pre>
 0:   5177.68 ms, total:   5302 ms
 1:   2570.16 ms, total:   2614 ms
 2:   1353.06 ms, total:   1391 ms
 3:   1957.63 ms, total:   1978 ms
 4:   1371.13 ms, total:   1386 ms
 5:   1765.55 ms, total:   1781 ms
 6:   1658.23 ms, total:   1673 ms
 7:   1273.87 ms, total:   1289 ms
 8:   1355.19 ms, total:   1380 ms
 9:   1152.78 ms, total:   1167 ms
10:   1787.91 ms, total:   1802 ms
...

90:   1116.25 ms, total:   1128 ms
91:    989.50 ms, total:   1001 ms
92:    833.24 ms, total:    844 ms
93:   1137.83 ms, total:   1150 ms
94:    969.47 ms, total:    982 ms
95:   1138.04 ms, total:   1149 ms
96:   1155.98 ms, total:   1168 ms
97:   1178.15 ms, total:   1193 ms
98:   1120.18 ms, total:   1132 ms
99:   1013.16 ms, total:   1025 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.8201s / 5.1777s
Elapsed runtime:           127.555 seconds
QMpH:                      2822.321 query mixes per hour
</pre>
 </code>
</blockquote>


<p>The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%.</p>

<p>Now, SSDs with max speculation.</p>

<blockquote>
 <code><pre>
 0:  44670.34 ms, total:  44809 ms
 1:  18490.44 ms, total:  18548 ms
 2:   7306.12 ms, total:   7353 ms
 3:   9452.66 ms, total:   9485 ms
 4:   5648.56 ms, total:   5668 ms
 5:   5493.21 ms, total:   5511 ms
 6:   5951.48 ms, total:   5970 ms
 7:   3815.59 ms, total:   3834 ms
 8:   4560.71 ms, total:   4579 ms
 9:   3523.74 ms, total:   3543 ms
10:   4724.04 ms, total:   4741 ms
...

90:    673.53 ms, total:    685 ms
91:    534.62 ms, total:    545 ms
92:    730.81 ms, total:    742 ms
93:   1358.14 ms, total:   1370 ms
94:   1098.64 ms, total:   1110 ms
95:   1232.20 ms, total:   1243 ms
96:   1259.57 ms, total:   1273 ms
97:   1298.95 ms, total:   1310 ms
98:   1156.01 ms, total:   1166 ms
99:   1025.45 ms, total:   1034 ms

Scale factor:              2848260
Number of warmup runs:     0
Seed:                      808080
Number of query mix runs 
  (without warmups):       100 times
min/max Querymix runtime:  0.4725s / 44.6703s
Elapsed runtime:           269.323 seconds
QMpH:                      1336.683 query mixes per hour
</pre>
 </code>
</blockquote>


<p>The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%.</p>

<p>The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%.</p>

<p>We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however. </p>

<p>We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower.</p>



<p>We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with <code>cat file /dev/null</code> and two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with the <code>cat</code> to <code>/dev/null</code> figure. We will later test with 8 SSDs with better controllers. </p>

<p>Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended. </p>


<p>Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM <i>Explore</i> in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1c9bca60">data</a> that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1c05e280">RDF</a> literal strings, presumably on behalf of the <code>DESCRIBE</code> query in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized.</p>

<p>The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the <code>DESCRIBE</code>. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect.</p>

 
<h2>Conclusions</h2>


<p>We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs.</p>

<p>More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns.</p>

<p>As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li> <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1b4342b0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1d3e7388">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x153c7ba8">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1da11d98">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1d25d630">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs</a>
</li>
<li>
 Benchmarks, Redux (part 6): BSBM and I/O, continued <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1f1f5ee8">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1cd44938">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1d51f848">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x13d333c0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1e77a5e8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ea1fb40">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e7786c8">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f8a37f8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1c69e018">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-07#1668">
  <rss:title>Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-07T19:17:36Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM. There are two approaches: run twice or otherwise make sure one runs from memory and forget about I/O, or make rules and metrics for warm-up. We will see if the second is possible with BSBM. From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference. In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays. Storage Arrays Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with `cat file &gt; /dev/null`. The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread. Two different read-ahead schemes are used: With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read. With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed. In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM. There are a few different possibilities for the physical I/O: Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations. A thread finds it needs a page and reads it. Using Unix asynchronous I/O, aio.h, with the aio_* and lio_listio functions. Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency. The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads. There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H. While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure. The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question. This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read. The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher. The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here. The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store. For the sake of simplicity we only run 7 Single with the 1000 Mt scale. The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds. The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing. The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion. There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput. We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so. Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want. So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see. So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first. The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput. Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more. BSBM Note We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum. Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead. A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything. Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post) Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1ea17348">BSBM</a>. </p>

<p>There are two approaches:</p> 

<ol>
<li>
  <p>run twice or otherwise make sure one runs from memory and forget about I/O, or</p>
</li>
<li>
  <p>make rules and metrics for warm-up.</p>
</li>
</ol>
<p>We will see if the second is possible with BSBM.</p>

<p>From this starting point, we look at various ways of scheduling I/O in <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x125c4f90">Virtuoso</a> using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference. </p>


<p>In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="9" align="center">Storage Arrays</th>
	</tr>
	<tr>
		<th align="center"> Type </th>
		<th align="center"> Quantity </th>
		<th align="center"> Maker </th>
		<th align="center"> Size </th>
		<th align="center"> Speed </th>
		<th align="center"> Interface speed </th>
		<th align="center"> Controller </th>
		<th align="center"> Drive <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1cab1358">Cache</a> </th>
		<th align="center"> RAID </th>
	</tr>
	<tr>
		<td align="center"> SSD </td>
		<td align="center"> 4 </td>
		<td align="center"> Crucial </td>
		<td align="center"> 128 GB </td>
		<td align="center"> N/A </td>
		<td align="center"> 6Gbit SATA </td>
		<td align="center"> RocketRaid 640 </td>
		<td align="center"> 128 MB </td>
		<td align="center"> None </td>
	</tr>
	<tr>
		<td align="center"> HDD </td>
		<td align="center"> 4 </td>
		<td align="center"> Samsung </td>
		<td align="center"> 1000 GB </td>
		<td align="center"> 7200 RPM </td>
		<td align="center"> 3Gbit SATA </td>
		<td align="center"> <a class="auto-href" href="http://dbpedia.org/resource/Intel_Corporation" id="link-id0x1ab6edd8">Intel</a> ICH on Supermicro motherboard </td>
		<td align="center"> 16 MB </td>
		<td align="center"> None </td>
	</tr>
</table>


<p>We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with <code>`cat file &gt; /dev/null`</code>. </p>

<p>The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.</p>

<p>Two different read-ahead schemes are used: </p>
<ul>
 <li>
  <p>With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.</p>
 </li>
<li>
  <p>With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.</p>
</li>
</ul>

<p>In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.</p>

<p>There are a few different possibilities for the physical I/O: </p>

<ol>
<li>
  <p>Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.</p>
</li>
<li>
  <p>A thread finds it needs a page and reads it.</p>
</li>
<li>
  <p>Using Unix asynchronous I/O, <code>aio.h</code>, with the <code>aio_*</code> and <code>lio_listio</code> functions.</p>
</li>
<li>
  <p>Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.</p>
</li>
</ol>

<p>The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set).  These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.</p>

<p>There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the <a href="http://www.cwi.nl/" id="link-id0x1f62abb8">CWI collaborative scan paper</a>. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1bfb17c0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x1e76bfc8">H</a>.</p>

<p>While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.</p>

<p>The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1d7b1d10">CPU</a>. When running from memory, the CPU is around 1350% for the system in question. </p>

<p>This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read. </p>

<p>The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.</p>

<p>The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/ldmeter.sql" id="link-id0x1b5adb08">available here</a>. </p>


<p>The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.</p>

<p>For the sake of simplicity we only run 7 Single with the 1000 Mt scale.</p>


<p>The first experiment was with SSDs and the vectored read-ahead.  The target throughput was reached after 280 seconds. </p>

<p>The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.</p>

<p>The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.</p>

<p>There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.</p>

<p>We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so. </p>

<p>Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.</p>

<p>So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see. </p>

<p>So we try, and we find that read-ahead does not account for most pages since it does not get triggered.  Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.</p>

<p>The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.</p>
<p>

</p>
<p>Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more. </p>
<p>


</p>
<h3>BSBM Note </h3>

<p>We look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.</p>

<p>Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.</p>

<p>A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1d6371d8">RDF</a> stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.</p>

<p>Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.</p>


<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li> <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1ecb2af0">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x19d05678">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1d542328">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x13947e08">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1a7f6b30">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1d67dd40">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1ebcee68">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1a855ba0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1b081e70">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1d7a7940">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d7e2cd0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e375338">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d199728">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e808818">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-04#1666">
  <rss:title>Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-04T20:28:28Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Below is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning. Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention. The following three settings are all in the [Parameters] section of the virtuoso.ini file. AsyncQueueMaxThreads controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better. ThreadsPerQuery is the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better. IndexTreeMaps is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial. A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact. In the [HTTPServer] section of the virtuoso.ini file, the ServerThreads setting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available. Note — The [HTTPServer] ServerThreads are taken from the total pool made available by the [Parameters] ServerThreads. Thus, the [Parameters] ServerThreads should always be at least as large as (and is best set greater than) the [HTTPServer] ServerThreads, and if using the closed-source Commercial Version, [Parameters] ServerThreads cannot exceed the licensed thread count. File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question. It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the segment declaration in the virtuoso.ini file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples. in the [Parameters] section of the virtuoso.ini file, set FDsPerFile to be (the number of concurrent threads * 1.5) ÷ the number of distinct database files. There are no SSD specific settings. Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes? Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed. Use the built-in bulk load facility, i.e., ld_dir (&#39;&lt;source-filename-or-directory&gt;&#39;, &#39;&lt;file name pattern&gt;&#39;, &#39;&lt;destination graph iri&gt;&#39;); For example, SQL&gt; ld_dir (&#39;/path/to/files&#39;, &#39;*.n3&#39;, &#39;http://dbpedia.org&#39;); Then do a rdf_loader_run () on enough connections. For example, you can use the shell command isql rdf_loader_run () &amp; to start one in a background isql process. When starting background load commands from the shell, you can use the shell wait command to wait for completion. If starting from isql, use the wait_for_children; command (see isql documentation for details). See the BSBM disclosure report for an example load script. What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint. Execute CHECKPOINT; through a SQL client, e.g., isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol. What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load. No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with DefaultIsolation = 4 Transaction isolation cannot be set over the SPARQL protocol. NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update. What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured. In the [Parameters] section of the virtuoso.ini file, NumberOfBuffers controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If &quot;swappiness&quot; on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting. What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache? In an isql session, execute STATUS ( ? ? ); The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format. What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index. Execute on an isql session: CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC; The iss_pages column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to RDF_QUAD are for quads; RDF_IRI, RDF_PREFIX, RO_START, RDF_OBJ are for dictionaries; RDF_OBJ_RO_FLAGS_WORDS and VTLOG_DB_DBA_RDF_OBJ are for text index. If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph. The default scheme uses quads. The default index layout is PSOG, POGS, GS, SP, OP. To see the current index scheme, use an isql session to execute STATISTICS DB.DBA.RDF_QUAD; For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by S or O depending on which is first in key order for each index? The default partitioning settings are good, i.e., partitioning is on O or S, whichever is first in key order. For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect? In the [Cluster] section of the cluster.ini file, ReqBatchSize is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of 10000 to 50000 or so if this is seen to be useful. To change this on the fly, the following can be issued through an isql session: cl_exec ( &#39; __dbf_set (&#39;&#39;cl_request_batch_size&#39;&#39;, 50000) &#39; ); The commands below may be executed through an isql session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields. STATUS (&#39;cluster&#39;) ;; whole cluster STATUS (&#39;cluster_d&#39;) ;; process-by-process Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings? For BSBM, needless query optimization should be capped at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini, with StopCompilerWhenXOverRun = 1 When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the [Parameters] section of the virtuoso.ini file, with DefaultIsolation = 2 Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post) Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Below is a questionnaire I sent to the <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x17f62428">BSBM</a> participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1d48ed28">Virtuoso</a>, here. This can be a checklist for pretty much any <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e11b228">RDF</a> database tuning.</p>


<ol>
<li>
<p>
    <b>Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.</b>
  </p>

<p>The following three settings are all <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1ed4fe10">in the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>. </p>

<ul>
<li>
      <p>
     <b><code>AsyncQueueMaxThreads</code>
     </b> controls the size of a pool of extra threads that can be used for query parallelization. This should be set to either <b>1.5 * the number of cores</b> or <b>1.5 * the number of core threads</b>; see which works better.</p>
    </li>

<li>
      <p>
     <b><code>ThreadsPerQuery</code>
     </b> is the maximum number of threads a single query will take. This should be set to either <b>the number of cores</b> or <b>the number of core threads</b>; see which works better. </p>
    </li>

<li>
      <p>
     <b><code>IndexTreeMaps</code>
     </b> is the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (<b>256</b> in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to <b>64, 128, or 512</b> may be beneficial.</p>

<p>A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1e12b618">cache</a> artifact.</p>
    </li>
</ul>

<p></p>
  <p>
    <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_HTTPServer" id="link-id0x1f8960a0">In the <code>[HTTPServer]</code> section of the <code>virtuoso.ini</code> file</a>, the <b><code>ServerThreads</code></b> setting is the number of web server threads, i.e., the maximum number of concurrent <a class="auto-href" href="http://www.w3.org/TR/rdf-sparql-protocol/" id="link-id0x17e4d690">SPARQL protocol</a> requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.</p>
<p>Note — The <code>[HTTPServer] ServerThreads</code> are taken from the total pool made available by the <code>[Parameters] ServerThreads</code>. Thus, the <code>[Parameters] ServerThreads</code> should always be at least as large as (and is best set greater than) the <code>[HTTPServer] ServerThreads</code>, and if using the closed-source Commercial Version, <code>[Parameters] ServerThreads</code> cannot exceed the licensed thread count. </p>
</li>


<li>
<p>
    <b>File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.</b>
  </p>

<p>It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the <code>segment</code> declaration in the <code>virtuoso.ini</code> file. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1e9a6bc0">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1ebdf210">C</a> sample for examples. </p>

<p> <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1f893f48">in the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, set <code>FDsPerFile</code> to be <code> (the number of concurrent threads * 1.5) ÷ the number of distinct database files</code>.</p>

<p>There are no SSD specific settings.</p>
</li>


<li>
<p>
    <b>Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes? </b>
  </p>

<p>Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.</p>

<p>Use the built-in bulk load facility, i.e., </p>
<blockquote>
    <code>ld_dir (&#39;&lt;source-filename-or-directory&gt;&#39;, &#39;&lt;file name pattern&gt;&#39;, &#39;&lt;destination graph iri&gt;&#39;);</code>
  </blockquote>
<p>For example,</p>
<blockquote>
    <code><a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x1dc52c58">SQL</a>&gt; ld_dir (&#39;/path/to/files&#39;, &#39;*.n3&#39;, &#39;<a class="auto-href" href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1e76bfc8">http</a>://<a class="auto-href" href="http://dbpedia.org/resource/DBpedia" id="link-id0x1e9a6ad8">dbpedia</a>.org&#39;);</code>
  </blockquote>
<p>Then do a <code>rdf_loader_run ()</code> on enough connections. For example, you can use the shell command </p>
<blockquote>
    <code>isql rdf_loader_run () &amp;</code> </blockquote>
<p>to start one in a background isql process. When starting background load commands from the shell, you can use the shell <code>wait</code> command to wait for completion. If starting from isql, use the <code>wait_for_children;</code> command (see <a href="http://docs.openlinksw.com/virtuoso/isql.html" id="link-id0x1ae0f230">isql documentation</a> for details). </p>
<p>See the <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d635820">BSBM disclosure report</a> for an example load script.</p>
</li>


<li>
<p>
    <b>What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1e6f1000">CPU</a>-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.</b>
  </p>

<p>Execute </p>
<blockquote>
    <code> CHECKPOINT;</code>
  </blockquote> 
<p>through a SQL client, e.g., <code>isql</code>. This is not a <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1c2401d8">SPARQL</a> statement and cannot be executed over the SPARQL protocol.</p>
</li>


<li>
<p>
    <b>What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.</b>
  </p>

<p>No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is <b><code>REPEATABLE READ</code></b>, but this may be altered via SQL session settings or at Virtuoso server start-up through <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1a791b80">the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, with</p>
<blockquote>
   <b><code><a href="http://wikis.openlinksw.com/dataspace/owiki/wiki/VirtuosoWikiWeb/ChangeVirtuosoSDefaultTransactionIsolationLevel" id="link-id0x1e5536b8">DefaultIsolation</a> = 4</code>
   </b>
  </blockquote>
<p> Transaction isolation cannot be set over the SPARQL protocol.</p>
<p> NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to <a class="auto-href" href="http://dbpedia.org/resource/ACID" id="link-id0x1ce6a310">ACID</a> considerations.  See answer #12, below, and detailed discussion in part 8 of this series, <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b7eb5f0">BSBM <i>Explore and Update</i></a>.</p>
</li>


<li>
<p>
    <b>What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured. </b>
  </p>

<p>
    <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1acd8fe8">In the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, <b><code>NumberOfBuffers</code></b> controls the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If &quot;swappiness&quot; on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting.</p>
</li>


<li>
<p>
    <b>What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache? </b>
  </p>

<p>In an <code>isql</code> session, execute </p>
<blockquote>
    <code>STATUS ( ? ? );</code>
  </blockquote> 
<p>The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.</p>
</li>


<li>
<p>
    <b>What command gives <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x1c185f28">information</a> on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.</b>
  </p>


<p>Execute on an <code>isql</code> session: </p>

<blockquote>
   <code><pre>
CHECKPOINT;
SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
</pre>
   </code>
  </blockquote>

<p>The <code>iss_pages</code> column is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining to <code>RDF_QUAD</code> are for quads; <code>RDF_IRI</code>, <code>RDF_PREFIX</code>, <code>RO_START</code>, <code>RDF_OBJ</code> are for dictionaries; <code>RDF_OBJ_RO_FLAGS_WORDS</code> and <code>VTLOG_DB_DBA_RDF_OBJ</code> are for text index. </p>


</li>
<li>
<p>
    <b>If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1c573780">data</a> will be in a single big graph.</b>
  </p>

<p>The default scheme uses quads. The default index layout is <code>PSOG</code>, <code>POGS</code>, <code>GS</code>, <code>SP</code>, <code>OP</code>. To see the current index scheme, use an <code>isql</code> session to execute</p>
<blockquote>
    <code>STATISTICS DB.DBA.RDF_QUAD;</code>
  </blockquote>


</li>
<li>
<p>
    <b>For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by <code>S</code> or <code>O</code> depending on which is first in key order for each index? </b>
  </p>

<p>The default partitioning settings are good, i.e., partitioning is on <code>O</code> or <code>S</code>, whichever is first in key order.</p>


</li>
<li>
<p>
    <b>For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?</b>
  </p>

<p> <a href="http://docs.openlinksw.com/virtuoso/clusteroperation.html#clusteroperationgeneralclusterinifields" id="link-id0x1ec6dff0">In the <code>[Cluster]</code> section of the <code>cluster.ini</code> file</a>, <b><code>ReqBatchSize</code></b> is the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of <code>10000</code> to <code>50000</code> or so if this is seen to be useful. </p>

<p>To change this on the fly, the following can be issued through an <code>isql</code> session:</p>
<blockquote>
<code>cl_exec ( &#39; __dbf_set (&#39;&#39;cl_request_batch_size&#39;&#39;, 50000) &#39; ); </code>
  </blockquote>

<p>The commands below may be executed through an <code>isql</code> session to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation <a href="http://docs.openlinksw.com/virtuoso/clusteroperation.html#clusteroperationadminstdispl" id="link-id0x1dfccec0">details the fields</a>. </p>
<blockquote>
   <pre> <code>STATUS (&#39;cluster&#39;)      ;; whole cluster</code> <br /> <code>STATUS (&#39;cluster_d&#39;)    ;; process-by-process</code>
   </pre></blockquote>

</li>
<li>
<p>
    <b>Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM <i>Explore</i> mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?</b>
  </p>

<ul>
<li>
      <p>For BSBM, needless query <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1f0ffab8">optimization</a> should be capped at Virtuoso server start-up through the <code>[Parameters]</code> section of the <code>virtuoso.ini</code>, with</p>
<blockquote>
     <b><code>StopCompilerWhenXOverRun = 1</code>
     </b>
      </blockquote> </li>
<li>
      <p>When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of <code>READ COMMITTED</code>, to remove most lock contention.  Transaction isolation cannot be adjusted via SPARQL.  This can be changed through SQL session settings, or at Virtuoso server start-up <a href="http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_Parameters" id="link-id0x1f3a43c8">through the <code>[Parameters]</code> section of the <code>virtuoso.ini</code> file</a>, with</p>
<blockquote>
     <b><code><a href="http://wikis.openlinksw.com/dataspace/owiki/wiki/VirtuosoWikiWeb/ChangeVirtuosoSDefaultTransactionIsolationLevel" id="link-id0x1a5a51e0">DefaultIsolation</a> = 2</code>
     </b>
      </blockquote>
</li>
</ul>
</li>
</ol>



<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d6e5428">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1c3ea770">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1efeca30">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire <i>(this post)</i>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1bda5158">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1ec74808">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1ea253a0">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1b02d528">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1ae81fc0">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x197515c0">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1a78db90">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d32ae10">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1e8fcc18">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ae95050">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1dbf3158">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-03-02#1664">
  <rss:title>Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-03-02T23:23:16Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">In this post I will summarize the figures for BSBM Load and Explore mixes at 100 Mt, 200 Mt, and 1000 Mt. (1 Mt = 1 Megatriple, or one million triples.) The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs. The exact specifications and configurations are in the raw reports to follow. The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers. In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso: 6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference. 6 Cluster is the generally available commercial only cluster-capable Virtuoso. 7 Single is the next generation single server Virtuoso, about to be released as a preview. To understand the numbers, we must explain how these differ from each other in execution: 6 Single has one thread-per-query, and operates on one state of the query at a time. 6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together. 7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the n * log(n) index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time. The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts. We will first consider loading. Below are the load times and rates for 7 at each scale. 7 Single Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 261,366 301 82 200 Mt 216,000 802 123 1000 Mt 130,378 6641 1012 In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale. We also loaded the smallest data set with 6 Single using the same load script. 6 Single Scale Rate (quads per second) Load time (seconds) Checkpoint time (seconds) 100 Mt 74,713 1192 145 CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores. The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row. Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert. Now for the run (larger numbers indicate more queries executed, and are therefore better): 6 Single Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487 7 Single Throughput (QMpH, query mixes per hour) Scale Single User 16 User 100 Mt 11742 72278 200 Mt 10225 60951 1000 Mt 6262 24672 The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run. Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale. Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps. The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt. In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later. The full run details will be supplied at the end of this blog series. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore (this post) Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>In this post I will summarize the figures for <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1dcf58f8">BSBM</a> Load and <i>Explore</i> mixes at 100 Mt, 200 Mt, and 1000 Mt.  (1 Mt = 1 Megatriple, or one million triples.)  The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs.  The exact specifications and configurations are in the raw reports to follow.</p>

<p>The load time in <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1f3716d8">the recent Berlin report</a> was measured with <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html#resultsExplore" id="link-id0x1dd37f80">the wrong function</a>, and so far as we can tell, without multiple threads. The intermediate cut of <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1ddb0c90">Virtuoso</a> they tested also <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html#resultsExploreAndUpdate" id="link-id0x1e5fcf40"> had broken</a> <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x1e1d2b70">SPARQL</a>/<a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1bfb00c0">Update</a> (also known as <a class="auto-href" href="http://dbpedia.org/page/SPARUL" id="link-id0x1e0d5fd8">SPARUL</a>) features.  We have fixed this since, and give <a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/BenchmarksReduxSupportingFiles/results.zip" id="link-id0x1edf36b0">here the right numbers</a>.</p>

<p>In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> is the generally available single server configuration of Virtuoso.  Whether this is open source or not does not make a difference.</p>
 </li>
<li>
  <p>
    <i>6 Cluster</i> is the generally available commercial only cluster-capable Virtuoso.</p>
</li>
<li>
  <p>
    <i>7 Single</i> is the next generation single server Virtuoso, about to be released as a preview.</p>
</li>
</ul>

<p>To understand the numbers, we must explain how these differ from each other in execution:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> has one thread-per-query, and operates on one state of the query at a time.</p>
 </li>

<li>
  <p>
    <i>6 Cluster</i> has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states.  Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.</p>
</li>

<li>
  <p>
    <i>7 Single</i> has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states.  This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the <code>n * log(n)</code> index access for the batch becomes more like linear if the <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x1ceca188">data</a> accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads.  Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan.  These features are called <i>vectored execution</i> and <i>query parallelization</i>.  These techniques will also be applied to the cluster variant in due time.</p>
</li>
</ul>

<p>The version 6 and 7 variants discussed here use the same physical storage layout with row-wise <a class="auto-href" href="http://dbpedia.org/resource/Data_compression" id="link-id0x1e521fa0">key compression</a>.  Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space.  This column store option is not used here because it still has some problems with random order inserts.</p>

<p> We will first consider loading.  Below are the load times and rates for 7 at each scale.</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">7 Single</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 261,366 </td>
		<td align="center"> 301 </td>
		<td align="center"> 82 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 216,000 </td>
		<td align="center"> 802 </td>
		<td align="center"> 123 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 130,378 </td>
		<td align="center"> 6641 </td>
		<td align="center"> 1012 </td>
	</tr>
</table>

<p>In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.</p>

<p>We also loaded the smallest data set with 6 Single using the same load script.

</p>
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="4" align="center">6 Single</th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center">Rate <br /> (quads per second)</th>
		<th align="center">Load time <br /> (seconds)</th>
		<th align="center">Checkpoint time <br /> (seconds)</th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 74,713 </td>
		<td align="center"> 1192 </td>
		<td align="center"> 145 </td>
	</tr>
</table>


<p>
<a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1132ad18">CPU</a> time with 6 Single was 8047 seconds.  We compare this to 4453 seconds of CPU for the same load on 7 Single.  The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single.  Note that high percentages involve core threads, not real cores. </p>

<p>The difference is mostly attributable to vectoring and the introduction of a non-transactional insert.  The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in <i>de facto</i> non-transactional behavior but still there is a lock and commit cycle.  Inserts in <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1c750368">RDF</a> load usually exhibit locality on all SPOG.  Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go.  Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.</p>

<p>Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed.  In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful.  Writes are all in-place, and no delta-merge mechanism is involved.  For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block.  Repeatable and serializable readers would block before an uncommitted insert.</p>



<p>Now for the run (larger numbers indicate more queries executed, and are therefore better):</p>

<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 6 Single Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 7641 </td>
		<td align="center"> 29433 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 6017 </td>
		<td align="center"> 13335 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 1770 </td>
		<td align="center"> 2487 </td>
	</tr>
</table>
<br />
<table border="1" cellspacing="2" cellpadding="2" align="center" width="90%">
	<tr>
		<th colspan="3" align="center"> 7 Single Throughput <br /> (QMpH, query mixes per hour) </th>
	</tr>
	<tr>
		<th align="center">Scale</th>
		<th align="center"> Single User </th>
		<th align="center"> 16 User </th>
	</tr>
	<tr>
		<th align="center">100 Mt</th>
		<td align="center"> 11742 </td>
		<td align="center"> 72278 </td>
	</tr>
	<tr>
		<th align="center">200 Mt</th>
		<td align="center"> 10225 </td>
		<td align="center"> 60951 </td>
	</tr>
	<tr>
		<th align="center">1000 Mt</th>
		<td align="center"> 6262 </td>
		<td align="center"> 24672 </td>
	</tr>
</table>

<p>The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state.  Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed.  For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.</p>

<p>Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%.  This also explains why adding clients gives a larger boost at the smaller scale. </p>

<p>Now let us look at the relative effects of parallelizing and vectoring in 7 Single.  We run 50 mixes of Single User <i>Explore</i>: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread.  Then we set the vector size to 1, meaning that the query pipeline runs one row at a time.  This gets us 1319 QMpH which is a bit worse than 6 Single.  This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.</p>

<p>The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least.  The reason for the latter is covered in detail in <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1b9ed390">A Benchmarking Story</a>.  We note that while vectoring is primarily geared to better single-thread speed and better <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1ddc2f48">cache</a> hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.</p>

<p>In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring.  When moving to more complex workloads, the benefits become more pronounced.  For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring.  These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.</p>

<p>The full run details will be supplied at the end of this <a class="auto-href" href="http://dbpedia.org/resource/Blog" id="link-id0x1e9f6960">blog</a> series.</p>

<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1d0bb988">Benchmarks, Redux (part 1): On RDF Benchmarks </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x155fc700">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d96e218">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1d7a5170">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1def9ca0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1a7a7800">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1e9c6c68">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1e80c208">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dafd290">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1f34f7f8">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1df24f50">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f4b19c8">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1de90cf8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ebefbe8">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-02-28#1661">
  <rss:title>Benchmarks, Redux (part 2): A Benchmarking Story</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-02-28T21:12:28Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Caeterum censeo, benchmarks are for vendors... This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks. We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let&#39;s talk about specifics. In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso: 6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference. 6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso. 7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview. We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM Explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so. At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%. When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload. See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous. Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex. Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better. Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root. One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers. Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time. So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously &quot;good&quot; throughput of 40K QMpH at 100 Mt went to 66K QMpH. Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue. We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation. So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks Benchmarks, Redux (part 2): A Benchmarking Story (this post) Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<blockquote>
<i>Caeterum censeo, benchmarks are for vendors...</i>
</blockquote>

<p>This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.</p>

<p>We begin right after the publication of the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1df843f8">recent Berlin report</a>. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1d0b6ea0">BSBM</a> matter and the general benchmarking question in forthcoming posts; for now, let&#39;s talk about specifics.</p>

<p>In the course of the discussion to follow, we talk about 3 different kinds of <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1e09ee88">Virtuoso</a>:</p>

<ul>
 <li>
  <p>
    <i>6 Single</i> is the generally available single-instance-server configuration of Virtuoso.  Whether this is open source or not does not make a difference.</p>
 </li>
<li>
  <p>
    <i>6 Cluster</i> is the generally available, commercial-only, cluster-capable Virtuoso.</p>
</li>
<li>
  <p>
    <i>7 Single</i> is the next-generation single-instance-server Virtuoso, about to be released as a preview.</p>
</li>
</ul>


<p>We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM <i>Explore</i> mix at one scale got better throughput as we added more clients, approximately as one would expect based on <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1c1b4860">CPU</a> usage and number of cores, while at another scale this was not so.</p>

<p>At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%. </p>

<p>When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload. </p>

<p> See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.</p>

<p>Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.</p>

<p>Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.</p>

<p>Now, the mutex in question happens to serialize the buffer <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x1e542088">cache</a> for one specific page of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x17c853d8">data</a>, one level down from the root of the index for <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1d64a658">RDF</a> PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.</p>

<p>One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.</p>

<p>Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.</p>

<p>So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously &quot;good&quot; throughput of 40K QMpH at 100 Mt went to 66K QMpH. </p>

<p>Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue. </p>

<p>We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real <a class="auto-href" href="http://dbpedia.org/resource/Knowledge" id="link-id0x1e7249d0">knowledge</a> of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application <a class="auto-href" href="http://dbpedia.org/resource/Program_optimization" id="link-id0x1d429e80">optimization</a> tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.</p>

<p>So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.</p>

<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1658" id="link-id0x1f619550">Benchmarks, Redux (part 1): On RDF Benchmarks</a>
</li>
<li>Benchmarks, Redux (part 2): A Benchmarking Story <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1caa7cd8">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1d8b7648">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1f2a6ba8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x17b425f0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x1a7f6b30">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1ee5ec98">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1b7c5af8">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1dad7588">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1c5520a0">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1eb19bf8">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1eb2c398">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1fb6a118">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1f160580">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-02-28#1659">
  <rss:title>Benchmarks, Redux (part 1): On RDF Benchmarks</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-02-28T20:20:22Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This post introduces a series on RDF benchmarking. In these posts I will cover the following: Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance. Discuss configuration options for Virtuoso. Tell a story about multithreading and its perils and how vectoring and scale-out can save us. Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single. Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general. Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report. Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark. Talk about BSBM in specific. What does it measure? Discuss some experiments with the BI use case of BSBM. Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure. The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS. Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see. For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor&#39;s permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment. In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project&#39;s own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything. So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni&#39;s cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one. So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first. In this series of posts I will only talk about OpenLink Software. Benchmarks, Redux Series Benchmarks, Redux (part 1): On RDF Benchmarks (this post) Benchmarks, Redux (part 2): A Benchmarking Story Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs Benchmarks, Redux (part 6): BSBM and I/O, continued Benchmarks, Redux (part 7): What Does BSBM Explore Measure? Benchmarks, Redux (part 8): BSBM Explore and Update Benchmarks, Redux (part 9): BSBM With Cluster Benchmarks, Redux (part 10): LOD2 and the Benchmark Process Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks Benchmarks, Redux (part 12): Our Own BSBM Results Report Benchmarks, Redux (part 13): BSBM BI Modifications Benchmarks, Redux (part 14): BSBM BI Mix Benchmarks, Redux (part 15): BSBM Test Driver Enhancements</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This post introduces a series on <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1e724ae0">RDF</a> benchmarking. In these posts I will cover the following:</p>

<ul>
 <li>
  <p>Correct misleading <a class="auto-href" href="http://dbpedia.org/resource/Information" id="link-id0x1e325480">information</a> about us in the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html" id="link-id0x1ded41d0">recent Berlin report</a>: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.</p>
 </li>

 <li>
  <p>Discuss configuration options for <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1e0a2548">Virtuoso</a>.</p>
</li>

 <li>
  <p>Tell a story about multithreading and its perils and how vectoring and scale-out can save us.</p>
</li>

 <li>
  <p>Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.</p>
</li>

 <li>
  <p>Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.</p>
</li>

 <li>
  <p>Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la <a class="auto-href" href="http://www.tpc.org/" id="link-id0x1e0ef4f0">TPC</a> or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in <a class="auto-href" href="http://lod2.eu/" id="link-id0x1e54d3d8">LOD2</a>, the EU FP7 that also funded the recent Berlin report.</p>
</li>

 <li>
  <p>Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.</p>
</li>

 <li>
  <p>Talk about <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x1e730bc8">BSBM</a> in specific. What does it measure?</p>
</li>

 <li>
  <p>Discuss some experiments with the BI use case of BSBM.</p>
</li>

 <li>
  <p>Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.</p>
</li>
</ul>

<p>The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1c1551e0">RDBMS</a>.</p>

<p>Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.</p>

<p>For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor&#39;s permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.</p>

<p>In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project&#39;s own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x11327f10">data</a> is not stretching anything.</p>

<p>So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni&#39;s cluster at <a class="auto-href" href="http://dbpedia.org/resource/Digital_Enterprise_Research_Institute" id="link-id0x1bfaffa0">DERI</a> and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon <a class="auto-href" href="http://aws.amazon.com/ec2/" id="link-id0x1bfafef8">EC2</a> might be an option, although not the preferred one.</p>

<p>So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since <a class="auto-href" href="http://dbpedia.org/resource/Ontotext" id="link-id0x1eccc1e0">Ontotext</a> and <a class="auto-href" href="http://freebase.com/guid/9202a8c04000641f8000000005c908d6" id="link-id0x1eccc208">Garlik</a> provided some information. We will look into these this and next week. We will not publish any information without asking first.</p>

<p>In this series of posts I will only talk about <a class="auto-href" href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id0x1bfa4030">OpenLink Software</a>.</p>

<h3>
<i>Benchmarks, Redux</i> Series</h3>
<ul>
<li>Benchmarks, Redux (part 1): On RDF Benchmarks <i>(this post)</i>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1660" id="link-id0x1b668d10">Benchmarks, Redux (part 2): A Benchmarking Story</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1663" id="link-id0x1b3a0c08">Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1665" id="link-id0x1f9f1740">Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1667" id="link-id0x1ad929f8">Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1669" id="link-id0x1db437c0">Benchmarks, Redux (part 6): BSBM and I/O, continued</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1671" id="link-id0x17138c38">Benchmarks, Redux (part 7): What Does BSBM Explore Measure?</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1673" id="link-id0x1c0e74f8">Benchmarks, Redux (part 8): BSBM Explore and Update </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1675" id="link-id0x1f297d10">Benchmarks, Redux (part 9): BSBM With Cluster</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1677" id="link-id0x1e4994b8">Benchmarks, Redux (part 10): LOD2 and the Benchmark Process</a>
</li>
<li>
 <a href="http://www.openlinksw.com/weblog/oerling/?id=1678" id="link-id0x1ebea6d0">Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1d5c86c0">Benchmarks, Redux (part 12): Our Own BSBM Results Report</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1efec0e0">Benchmarks, Redux (part 13): BSBM BI Modifications </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1a9941f8">Benchmarks, Redux (part 14): BSBM BI Mix </a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=" id="link-id0x1ea26de8">Benchmarks, Redux (part 15): BSBM Test Driver Enhancements </a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2011-01-19#1650">
  <rss:title>Virtuoso Directions for 2011</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2011-01-19T16:29:37Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">At the start of 2010, I wrote that 2010 would be the year when RDF became performance- and cost-competitive with relational technology for data warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of schema change. I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011. At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out. So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso&#39;s relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented HASH JOIN and GROUP BY, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do. At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns. A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso&#39;s column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types. The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side. The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results. LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the &quot;RDF tax,&quot; by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead. So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used. For now, our priority is to release the substantial gains that have already been accomplished. After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper. Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all. The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing. Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed. The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able? Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market. These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start. Stay tuned for more news later this month! Related Linked Data and Virtuoso in 2010 Linked Data &amp; The Year 2009 Retrospective and Outlook for 2008</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://www.openlinksw.com/weblog/oerling/?id=1603" id="link-id0x1d584720">At the start of 2010, I wrote</a> that 2010 would be the year when <a class="auto-href" href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x2007b778">RDF</a> became performance- and cost-competitive with relational technology for <a class="auto-href" href="http://dbpedia.org/resource/Data" id="link-id0x7f5bf68">data</a> warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of <a class="auto-href" href="http://dbpedia.org/resource/Database_schema" id="link-id0x1ffa18b0">schema</a> change.</p>

<p>I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.</p>

<p>At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, <i>column-wise compression</i> means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. <i>Vectored execution</i> means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.</p>

<p>So, during 2010, we integrated these technologies into <a class="auto-href" href="http://virtuoso.openlinksw.com" id="link-id0x1fdf3f90">Virtuoso</a>, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso&#39;s relational speed is not up there with the best of analytics-oriented <a class="auto-href" href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x7bf0d40">RDBMS</a>. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented <code><a class="auto-href" href="http://dbpedia.org/resource/Hash_join" id="link-id0x7815c60">HASH JOIN</a></code> and <code>GROUP BY</code>, and fine-tuned many of the tricks required by <a class="auto-href" href="http://www.tpc.org/" id="link-id0x213d6de8">TPC</a>-<a class="auto-href" href="http://dbpedia.org/resource/TPC-H" id="link-id0x1fd92690">H. TPC-H</a> is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.</p>

<p>At the Semdata workshop of <a class="auto-href" href="http://www.vldb2010.org/" id="link-id0x21178a50">VLDB 2010</a> <a href="http://www.openlinksw.com/weblog/oerling/?id=1632" id="link-id0x1de8fee8">we presented some results</a> of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.</p>

<p>A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize <a class="auto-href" href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x7ae0d58">CPU</a> <a class="auto-href" href="http://dbpedia.org/resource/Cache" id="link-id0x7bb7150">cache</a> and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso&#39;s column store implementation resembles in broad outline other column stores like <a class="auto-href" href="http://www.vertica.com/" id="link-id0x7f61080">Vertica</a> or <a class="auto-href" href="http://www.ingres.com/vectorwise/" id="link-id0x2154ce38">VectorWise</a>, the main difference being the built-in support for run-time heterogenous types.</p>

<p>The <a class="auto-href" href="http://lod2.eu/" id="link-id0x755e668">LOD2</a> EU FP 7 project <a href="http://www.openlinksw.com/weblog/oerling/?id=1630" id="link-id0x1d8eaf28">started in September 2010</a>. In this project OpenLink and the celebrated heroes of the column store, <a class="auto-href" href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x1feba470">CWI</a> of <a class="auto-href" href="http://dbpedia.org/resource/MonetDB" id="link-id0x223bbe70">MonetDB</a> and VectorWise fame, represent the database side.</p>

<p>The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The <a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x20f50c20">Berlin SPARQL Benchmark</a> (<a class="auto-href" href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x780c430">BSBM</a>) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.</p>

<p>LOD2 will continue by <i>linking the universe,</i> as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the &quot;RDF tax,&quot; by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.</p>

<p>So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.</p>

<p>For now, our priority is to release the substantial gains that have already been accomplished.</p>

<p>After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and <a class="auto-href" href="http://dbpedia.org/resource/SPARQL" id="link-id0x25716618">SPARQL</a> and seeing how it goes. In <a href="http://www.openlinksw.com/weblog/oerling/?id=1627" id="link-id0x1af60d40">the September paper</a> we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as <a class="auto-href" href="http://dbpedia.org/resource/SQL" id="link-id0x223b0a88">SQL</a> and SPARQL, should make a good VLDB paper.</p>

<p>Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-<a class="auto-href" href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x1f72a180">C</a>) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.</p>

<p>The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.</p>

<p>Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.</p>

<p>The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like <a class="auto-href" href="http://dbpedia.org/resource/Rule_Interchange_Format" id="link-id0x22b3ad68">RIF</a> and <a class="auto-href" href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x22b3ad90">OWL</a> is not expressive enough for the real world. As one expert put it, <i>if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases,</i> which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?</p>

<p>Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.</p>

<p>These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of <a class="auto-href" href="http://dbpedia.org/resource/Datalog" id="link-id0x7ee5130">Datalog</a>, is the widespread adoption of RDF and <a class="auto-href" href="http://dbpedia.org/resource/Linked_Data" id="link-id0x2111f968">linked data</a> as a data publishing format, with all the schema-last and <a class="auto-href" href="http://dbpedia.org/resource/Open_world_assumption" id="link-id0x2111f990">open world</a> aspects that have been there from the start.</p>

<p>Stay tuned for more news later this month!</p>

<h3>Related</h3>
<ul>
 <li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1603" id="link-id0x1de6b370">Linked Data and Virtuoso in 2010</a>
 </li>
 <li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1510" id="link-id0x1b031180">Linked Data &amp; The Year 2009</a>
 </li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1286" id="link-id0x1a582d10">Retrospective and Outlook for 2008</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-22#1638">
  <rss:title>Virtuoso 6.2 brings New Features!</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-22T21:08:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso 6.2 introduces a major number of enhancements to areas including... Linked Data Deployment Linked Data Middleware Data Virtualization Dynamic Data Exchange &amp; Data Replication Security Linked Data Deployment Feature Description Benefit Automatic Deployment Linked Data Pages are now automatically published for every Virtuoso Data Object; users need only load their data into the RDF Quad Store. Handcrafted URL-Rewrite Rules are no longer necessary. HTTP Metadata Enhancements HTTP Link: header is used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents. Enables HTTP-oriented tools to work with such relationships and other metadata. HTML Metadata Embedding HTML resource &lt;head /&gt; and &lt;link /&gt; elements and their @rel attributes are used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents. Enables HTML-oriented tools to work with such relationships and other metadata. Hammer Stack Auto-Discovery Patterns HTML resource &lt;head /&gt; section and &lt;link /&gt; elements, the HTTP Link: header, and XRD-based &quot;host-meta&quot; resources collectively provide structured metadata about Virtuoso hosts, associated Linked Data Spaces, and specific Data Items (Entities). Enables humans and machines to easily distinguish between Descriptor Resources and their Subjects, irrespective of URI scheme. Linked Data Middleware Feature Description Benefit New Sponger Cartridges New cartridges (data access and transformation drivers) for Twitter, Facebook, Amazon, eBay, LinkedIn, and others. Enable users and user agents to deal with the Sponged data spaces as though they were named graphs in a quad store, or tables in an RDBMS. New Descriptor Pages HTML-based descriptor pages are automatically generated. Descriptor subjects, and the constellation of navigable attribute-and-value pairs that constitute their descriptive representation, are clearly identified. Automatic Subject Identifier Generation De-referenceable data object identifiers are automatically created. Removes tedium and risk of error associated with nuance-laced manual construction of identifiers. Support for OData, JSON, RDFa Additional data representation and serialization formats associated with Linked Data. Increases flexibility and interoperability. Data Virtualization Feature Description Benefit Materialized RDF Views RDF Views over ODBC/JDBC Data Sources can now (optionally) keep the Quad Store in sync with the RDBMS data source. Enables high-performance Faceted Browsing while remaining sensitive to changes in the RDBMS data sources. CSV-to-RDF Transformation Wizard-based generation of RDF Linked Data from CSV files. Speeds deployment of data which may only exist in CSV form as Linked Data. Transparent Data Access Binding SPASQL (SPARQL Query Language integrated into SQL) is usable over ODBC, JDBC, ADO.NET, OLEDB, or XMLA connections. Enables Desktop Productivity Tools to transparently work with any blend of RDBMS and RDF data sources. Dynamic Data Exchange &amp; Data Replication Feature Description Benefit Quad Store to Quad Store Replication High-fidelity graph-data replication between one or more database instances. Enables a wide variety of deployment topologies. Delta Engine Automated generation of deltas at the named-graph-level, matches transactional replication offered by the Virtuoso SQL engine. Brings RDF replication on par with SQL replication. PubSubHubbub Support Deep integration within Quad Store as an optional mechanism for shipping deltas. Enables push-based data replication across a variety of topologies. Security Feature Description Benefit WebID support at the DBMS core Use WebID protocol for low-level ACL-based protection of database objects (RDF or Relational) and Web Services. Enables application of sophisticated security and data access policies to Web Services (e.g., SPARQL endpoint) and actual DBMS objects. Webfinger Supports using mailto: and acct: URIs in the context of WebID and other mechanisms, when domain holders have published necessary XRDS resources. Enables more intuitive identification of people and organizations. Fingerpoint Similar to Webfinger but does not require XRDS resources; instea,d it works directly with SPARQL endpoints exposed using auto-discovery patterns in the &lt;head /&gt; section of HTML documents. Enables more intuitive identification of people and organizations.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Virtuoso 6.2 introduces a major number of enhancements to areas including...</p>
<ul> <li>
  Linked Data Deployment
</li> 
<li>
  Linked Data Middleware</li> 
<li>
  Data Virtualization</li> 
<li>
  Dynamic Data Exchange &amp; Data Replication
</li> 
<li>
  Security</li>
</ul>  

<p> </p>
<h3>
<a name="LinkedDataDeployment" id="LinkedDataDeployment"></a> Linked Data Deployment</h3>
 
<table class="data" border="1" cellspacing="2" cellpadding="2">
 <tr>
<th id="0" width="15%">Feature</th>
<th id="1" width="42%">Description</th>
<th id="2">Benefit</th>
 </tr>
 <tr>
  <td align="center"> <b>Automatic Deployment</b>  </td>
  <td> Linked Data Pages are now automatically published for every Virtuoso Data Object; users need only load their data into the RDF Quad Store.   </td>
  <td> Handcrafted URL-Rewrite Rules are no longer necessary.  </td>
 </tr>
 <tr>
  <td align="center"> <b>HTTP Metadata Enhancements</b>  </td>
  <td> HTTP <code>Link:</code> header is used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents.  </td>
  <td> Enables HTTP-oriented tools to work with such relationships and other metadata.  </td>
 </tr>
 <tr>
  <td align="center"> <b>HTML Metadata Embedding</b>  </td>
  <td> HTML resource <code>&lt;head /&gt;</code> and <code>&lt;link  /&gt;</code> elements and their <code>@rel</code> attributes are used to transfer vital metadata  (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents.  </td>
  <td> Enables HTML-oriented tools to work with such relationships and other metadata.  </td>
 </tr>
 <tr>
  <td align="center"> <b>Hammer Stack Auto-Discovery Patterns</b>  </td>
  <td> HTML resource <code>&lt;head /&gt;</code> section and <code>&lt;link  /&gt;</code> elements, the HTTP <code>Link:</code> header, and XRD-based <code>&quot;host-meta&quot;</code> resources collectively provide structured metadata about Virtuoso hosts, associated Linked Data Spaces, and specific Data Items (Entities). </td>
  <td> Enables humans and machines to easily distinguish between Descriptor Resources and their Subjects, irrespective of URI scheme.  </td>
 </tr>
</table>

<h3>
<a name="LinkedDataMiddleware" id="LinkedDataMiddleware"></a> Linked Data Middleware</h3>
 
<table class="data" border="1" cellspacing="2" cellpadding="2">
<tr>
  <th id="3" width="15%">Feature</th>
<th id="4" width="42%">Description</th>
<th id="5">Benefit</th>
</tr>
<tr>
  <td align="center"> <b>New Sponger Cartridges</b>  </td>
  <td> New cartridges (data access and transformation drivers) for Twitter, Facebook, Amazon, eBay, <nop></nop>LinkedIn, and others.   </td>
  <td> Enable users and user agents to deal with the Sponged data spaces as though they were named graphs in a quad store, or tables in an RDBMS.  </td>
</tr>

<tr>
  <td align="center"> <b>New Descriptor Pages</b>  </td>
  <td> HTML-based descriptor pages are automatically generated.  </td>
  <td> Descriptor subjects, and the constellation of navigable attribute-and-value pairs that constitute their descriptive representation, are clearly identified.  </td>
</tr>
<tr>
  <td align="center"> <b>Automatic Subject Identifier Generation</b>  </td>
  <td> De-referenceable data object identifiers are automatically created.  </td>
  <td> Removes tedium and risk of error associated with nuance-laced manual construction of identifiers. </td>
</tr>
<tr>
  <td align="center">  <b>Support for OData, JSON, RDFa</b>  </td>
  <td> Additional data representation and serialization formats associated with Linked Data.  </td>
  <td> Increases flexibility and interoperability. </td>
</tr>

</table>
<h3>
<a name="DataVirtualization" id="DataVirtualization"></a> Data Virtualization</h3>
 
<table class="data" border="1" cellspacing="2" cellpadding="2">
<tr>
  <th id="6" width="15%">Feature</th>
<th id="7" width="42%">Description</th>
<th id="8">Benefit</th>
</tr>
<tr>
  <td align="center"> <b>Materialized RDF Views</b>  </td>
  <td> RDF Views over ODBC/JDBC Data Sources can now (optionally) keep the Quad Store in sync with the RDBMS data source.  </td>
  <td> Enables high-performance Faceted Browsing while remaining sensitive to changes in the RDBMS data sources.  </td>
</tr>

<tr>
  <td align="center"> <b>CSV-to-RDF Transformation</b>  </td>
  <td> Wizard-based generation of RDF Linked Data from CSV files.  </td>
  <td> Speeds deployment of data which may only exist in CSV form as Linked Data.  </td>
</tr>
<tr>
  <td align="center"> <b>Transparent Data Access Binding</b>  </td>
  <td> SPASQL (SPARQL Query Language integrated into SQL) is usable over ODBC, JDBC, ADO.NET, OLEDB, or XMLA connections.  </td>
  <td> Enables Desktop Productivity Tools to transparently work with any blend of RDBMS and RDF data sources. </td>
</tr>
</table>

<h3>
<a name="DynamicDataExchangeDataReplication" id="DynamicDataExchangeDataReplication"></a> Dynamic Data Exchange &amp; Data Replication</h3>
 
<table class="data" border="1" cellspacing="2" cellpadding="2">
<tr>
  <th id="9" width="15%">Feature</th>
<th id="10" width="42%">Description</th>
<th id="11">Benefit</th>
</tr>
<tr>
  <td align="center"> <b>Quad Store to Quad Store Replication</b>  </td>
  <td> High-fidelity graph-data replication between one or more database instances.  </td>
  <td> Enables a wide variety of deployment topologies.  </td>
</tr>

<tr>
  <td align="center"> <b>Delta Engine</b>  </td>
  <td> Automated generation of deltas at the named-graph-level, matches transactional replication offered by the Virtuoso SQL engine.  </td>
  <td> Brings RDF replication on par with SQL replication. </td>
</tr>
<tr>
  <td align="center">  <b><nop></nop>PubSubHubbub Support</b>  </td>
  <td> Deep integration within Quad Store as an optional mechanism for shipping deltas.  </td>
  <td> Enables push-based data replication across a variety of topologies. </td>
</tr>
</table>

<h3>
<a name="Security" id="Security"></a> Security</h3>
 
<table class="data" border="1" cellspacing="2" cellpadding="2">
<tr>
  <th id="12" width="15%">Feature</th>
<th id="13" width="42%">Description</th>
<th id="14">Benefit</th>
</tr>
<tr>
  <td align="center">  <b><nop></nop>WebID support at the DBMS core</b>  </td>
  <td> Use <nop></nop>WebID protocol for low-level ACL-based protection of database objects (RDF or Relational) and Web Services.  </td>
  <td> Enables application of sophisticated security and data access policies to Web Services (e.g., SPARQL endpoint) and actual DBMS objects. </td>
</tr>

<tr>
  <td align="center"> <b>Webfinger</b>  </td>
  <td> Supports using <code>mailto:</code> and <code>acct:</code> URIs in the context of <nop></nop>WebID and other mechanisms, when domain holders have published necessary XRDS resources.  </td>
  <td> Enables more intuitive identification of people and organizations. </td>
</tr>
<tr>
  <td align="center"> <b>Fingerpoint</b>  </td>
  <td> Similar to Webfinger but does not require XRDS resources; instea,d it works directly with SPARQL endpoints exposed using auto-discovery patterns in the <code>&lt;head  /&gt;</code> section of HTML documents.  </td>
  <td> Enables more intuitive identification of people and organizations. </td>
</tr>

</table>
<p> </p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-22#1637">
  <rss:title>The Business of Semantically Linked Data (&quot;SemData&quot;)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-22T18:20:56Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I had the opportunity the other day to converse about the semantic technology business proposition in terms of business development. My interlocutor was a business development consultant who had little prior knowledge of this technology but a background in business development inside a large diversified enterprise. I will here recap some of the points discussed, since these can be of broader interest. Why is there no single dominant vendor? The field is young. We can take the relational database industry as a historical precedent. From the inception of the relational database around 1970, it took 15 years for the relational model to become mainstream. &quot;Mainstream&quot; here does not mean dominant in installed base, but does mean something that one tends to include as a component in new systems. The figure of 15 years might repeat with RDF, from around 1990 for the first beginnings to 2015 for routine inclusion in new systems, where applicable. This does not necessarily mean that the RDF graph data model (or more properly, EAV+CR; Entity-Attribute-Value + Classes and Relationships) will take the place of the RDBMS as the preferred data backbone. This could mean that RDF model serialization formats will be supported as data exchange mechanisms, and that systems will integrate data extracted by semantic technology from unstructured sources. Some degree of EAV storage is likely to be common, but on-line transactional data is guaranteed to stay pure relational, as EAV is suboptimal for OLTP. Analytics will see EAV alongside relational especially in applications where in-house data is being combined with large numbers of outside structured sources or with other open sources such as information extracted from the web. EAV offerings will become integrated by major DBMS vendors, as is already the case with Oracle. Specialized vendors will exist alongside these, just as is the case with relational databases. Can there be a positive reinforcement cycle (e.g., building cars creates a need for road construction, and better roads drive demand for more cars)? Or is this an up-front infrastructure investment that governments make for some future payoff or because of science-funding policies? The Document Web did not start as a government infrastructure initiative. The infrastructure was already built, albeit first originating with the US defense establishment. The Internet became ubiquitous through the adoption of the Web. The general public&#39;s adoption of the Web was bootstrapped by all major business and media adopting the Web. They did not adopt the web because they particularly liked it, as it was essentially a threat to the position of media and to the market dominance of big players who could afford massive advertising in this same media. Adopting the web became necessary because of the prohibitive opportunity cost of not adopting it. A similar process may take place with open data. For example, in E-commerce, vendors do not necessarily welcome easy-and-automatic machine-based comparison of their offerings against those of their competitors. Publishing data will however be necessary in order to be listed at all. Also, in social networks, we have the identity portability movement which strives to open the big social network silos. Data exchange via RDF serializations, as already supported in many places, is the natural enabling technology for this. Will the web of structured data parallel the development of web 2.0? Web 2.0 was about the blogosphere, exposure of web site service APIs, creation of affiliate programs, and so forth. If the Document Web was like a universal printing press, where anybody could publish at will, Web 2.0 was a newspaper, bringing the democratization of journalism, creating the blogger, the citizen journalist. The Data Web will create the Citizen Analyst, the Mini Media Mogul (e.g., social-network-driven coops comprised of citizen journalists, analysts, and other content providers such as video and audio producers and publishers). As the blogosphere became an alternative news source to the big media, the web of data may create an ecosystem of alternative data products. Analytics is no longer a government or big business only proposition. Is there a specifically semantic market or business model, or will semantic technology be exploited under established business models and merged as a component technology into existing offerings? We have seen a migration from capital expenses to operating expenses in the IT sector in general, as exemplified by cloud computing&#39;s Platform as a Service (PaaS) and Software as a Service (SaaS). It is reasonable to anticipate that this trend will continue to Data as a Service (DaaS). Microsoft Odata and Dallas are early examples of this and go towards legitimizing the data as service concept. DaaS is not related to semantic technology per se, but since this will involve integration of data, RDF serializations will be attractive, especially given the takeoff of linked data in general. The data models in Odata are also much like RDF, as both stem from EAV+CR, which makes for easy translation and a degree of inherent interoperability. The integration of semantic technology into existing web properties and business applications will manifest to the end user as increased serendipity. The systems will be able to provide more relevant and better contextualized data for the user&#39;s situation. This applies equally to the consumer and business user cases. Identity virtualization in the forms of WebID and Webfinger — making first-class de-referenceable identifiers of mailto: and acct: schemes — is emerging as a new way to open social network and Web 2.0 data silos. On the software production side, especially as concerns data integration, the increased schema- and inference-flexibility of EAV will lead to a quicker time to answer in many situations. The more complex the task or the more diverse the data, the higher the potential payoff. Data in cyberspace is mirroring the complexity and diversity of the real world, where heterogeneity and disparity are simply facts of life, and such flexibility is becoming an inescapable necessity.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I had the opportunity the other day to converse about the semantic technology business proposition in terms of business development.  My interlocutor was a business development consultant who had little prior <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x245787f0">knowledge</a> of this technology but a background in business development inside a large diversified enterprise.</p>

<p>I will here recap some of the points discussed, since these can be of broader interest.</p>

<b><i>Why is there no single dominant vendor?</i></b>

<p>The field is young.  We can take the relational database industry as a historical precedent.  From the inception of the relational database around 1970, it took 15 years for the relational model to become mainstream. &quot;Mainstream&quot; here does not mean dominant in installed base, but does mean something that one tends to include as a component in new systems. The figure of 15 years might repeat with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x25147d40">RDF</a>, from around 1990 for the first beginnings to 2015 for routine inclusion in new systems, where applicable.</p>

<p>This does not necessarily mean that the RDF graph <a href="http://dbpedia.org/resource/Data" id="link-id0x25325290">data</a> model (or more properly, <a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x23dc23b0">EAV</a>+CR; <a href="http://dbpedia.org/resource/Entity-attribute-value_model" id="link-id0x25d7c238">Entity</a>-Attribute-Value + Classes and Relationships) will take the place of the <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x268cd248">RDBMS</a> as the preferred data backbone.  This could mean that RDF model serialization formats will be supported as data exchange mechanisms, and that systems will integrate data extracted by semantic technology from unstructured sources.  Some degree of EAV storage is likely to be common, but on-line transactional data is guaranteed to stay pure relational, as EAV is suboptimal for OLTP.  Analytics will see EAV alongside relational especially in applications where in-house data is being combined with large numbers of outside structured sources or with other open sources such as <a href="http://dbpedia.org/resource/Information" id="link-id0x23ce37d8">information</a> extracted from the web.</p>

<p>EAV offerings will become integrated by major DBMS vendors, as is already the case with <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x26a17030">Oracle</a>.  Specialized vendors will exist alongside these, just as is the case with relational databases.</p>

<p>
 <b><i>Can there be a positive reinforcement cycle (e.g., building cars creates a need for road construction, and better roads drive demand for more cars)?  Or is this an up-front infrastructure investment that governments make for some future payoff or because of science-funding policies?</i>
 </b>
</p>

<p>The Document Web did not start as a government infrastructure initiative.  The infrastructure was already built, albeit first originating with the US defense establishment. The <a href="http://dbpedia.org/resource/Internet" id="link-id0x2551cf60">Internet</a> became ubiquitous through the adoption of the Web.  The general public&#39;s adoption of the Web was bootstrapped by all major business and media adopting the Web. They did not adopt the web because they particularly liked it, as it was essentially a threat to the position of media and to the market dominance of big players who could afford massive advertising in this same media. Adopting the web became necessary because of the prohibitive opportunity cost of <i>not</i> adopting it.</p>

<p>A similar process may take place with open data.  For example, in E-commerce, vendors do not necessarily welcome easy-and-automatic machine-based comparison of their offerings against those of their competitors.  Publishing data will however be necessary in order to be listed at all.  Also, in social networks, we have the identity portability movement which strives to open the big social network silos.  Data exchange via RDF serializations, as already supported in many places, is the natural enabling technology for this.</p>

<p>
 <b><i>Will the web of structured data parallel the development of web 2.0?</i>
 </b>
</p>

<p>Web 2.0 was about the blogosphere, exposure of web site service APIs, creation of affiliate programs, and so forth.  If the Document Web was like a universal printing press, where anybody could publish at will, Web 2.0 was a newspaper, bringing the democratization of journalism, creating the blogger, the citizen journalist.  The Data Web will create the Citizen Analyst, the Mini Media Mogul (e.g., social-network-driven coops comprised of citizen journalists, analysts, and other content providers such as video and audio producers and publishers).  As the blogosphere became an alternative news source to the big media, the web of data may create an ecosystem of alternative data products. Analytics is no longer a government or big business only proposition.</p>

<p>
 <b><i>Is there a specifically semantic market or business model, or will semantic technology be exploited under established business models and merged as a component technology into existing offerings?</i>
 </b>
</p>

<p>We have seen a migration from capital expenses to operating expenses in the IT sector in general, as exemplified by cloud computing&#39;s Platform as a Service (PaaS) and Software as a Service (SaaS).  It is reasonable to anticipate that this trend will continue to Data as a Service (DaaS).  <a href="http://dbpedia.org/resource/Microsoft" id="link-id0x25382248">Microsoft</a> Odata and Dallas are early examples of this and go towards legitimizing the data as service concept.  DaaS is not related to semantic technology <i>per se</i>, but since this will involve integration of data, RDF serializations will be attractive, especially given the takeoff of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x24d9ac10">linked data</a> in general. The data models in Odata are also much like RDF, as both stem from EAV+CR, which makes for easy translation and a degree of inherent interoperability.</p>

<p>The integration of semantic technology into existing web properties and business applications will manifest to the end user as increased serendipity.  The systems will be able to provide more relevant and better contextualized data for the user&#39;s situation.  This applies equally to the consumer and business user cases.</p>

<p>Identity virtualization in the forms of WebID and Webfinger — making first-class de-referenceable identifiers of <code>mailto:</code> and <code>acct:</code> schemes — is emerging as a new way to open social network and Web 2.0 data silos.</p>

<p>On the software production side, especially as concerns data integration, the increased <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x252fd428">schema</a>- and inference-flexibility of EAV will lead to a quicker time to answer in many situations.  The more complex the task or the more diverse the data, the higher the potential payoff.  Data in <a href="http://dbpedia.org/resource/Cyberspace" id="link-id0x28ca9510">cyberspace</a> is mirroring the complexity and diversity of the real world, where heterogeneity and disparity are simply facts of life, and such flexibility is becoming an inescapable necessity.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1635">
  <rss:title>VLDB Semdata Workshop</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-21T21:14:14Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned. Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough. I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques. One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table? The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result. Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code. Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping. Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others. With all this cross-model operation, RDF is definitely not a closed island. We&#39;ll have to repeat this more. Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted. Let us talk about SpiderStore first. SpiderStore The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge. According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers. This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag. We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory. But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine. SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute. We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation. If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold. Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading. Webpie Webpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage. Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result. The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL INSERT … SELECT statements until no new inserts are produced. The only requirement is that the INSERT statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made. We have suggested such an experiment to the LarKC people. We will see.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I will begin by extending my thanks to the organizers, in specific <a href="http://members.deri.at/~retok" id="link-id0x236ebfd0">Reto Krummenacher</a> of <a href="http://www.sti-innsbruck.at/" id="link-id0x2371aca8">STI</a> and Atanas Kiryakov of <a href="http://dbpedia.org/resource/Ontotext" id="link-id0x22e24190">Ontotext</a> for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic <a href="http://dbpedia.org/resource/Data" id="link-id0x23781ba8">data</a> management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.</p>

<p>
<a href="http://semanticweb.org/id/Franz_Inc" id="link-id0x22e09fa8">Franz</a>, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, <a href="http://data.semanticweb.org/person/jans-aasman" id="link-id0x2380e7c8">Jans Aasman</a> of Franz talked about the telco call center automation solution by Amdocs, where the <a href="http://semanticweb.org/id/AllegroGraph" id="link-id0x237c9408">AllegroGraph</a> <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x236f96a8">RDF</a> store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.</p>

<p>I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.</p>

<p>One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x22ff2c78">URI</a> strings being stored in a separate table?</p>

<p>The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x236e43f8">SPARQL</a>, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x22e31050">schema</a> will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.</p>

<p>Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0x237d76e0">optimization</a> is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.</p>

<p>Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), <a href="http://dbpedia.org/resource/XML" id="link-id0x23845418">XML</a> (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a <a href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x22feefa0">CWI</a> prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing <a href="http://dbpedia.org/resource/XPath" id="link-id0x235b5890">XPath</a> and XSLT on the values, is entirely possible, at least in <a href="http://virtuoso.openlinksw.com" id="link-id0x237f6428">Virtuoso</a> which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x238265a8">RDBMS</a> with local and federated <a href="http://dbpedia.org/resource/SQL" id="link-id0x236f7bc8">SQL</a> is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.</p>

<p>Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.</p>

<p>With all this cross-model operation, RDF is definitely not a closed island. We&#39;ll have to repeat this more.</p>

<p>Of the academic papers, the SpiderStore (<a href="http://dbis-informatik.uibk.ac.at/5-1-Publications.html" id="link-id0x19ecd3f0">paper</a> is not yet available at time of writing, but should be soon) and <a href="http://www.few.vu.nl/~jui200/webpie.html" id="link-id0x1d60a498">Webpie</a> that should be specially noted.</p>

<p>Let us talk about SpiderStore first.</p>

<h2>SpiderStore</h2>

<p>The SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.</p>

<p>According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.</p>

<p>This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type <a href="http://dbpedia.org/resource/Tag" id="link-id0x236fe4d0">tag</a>.</p>

<p>We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.</p>

<p>But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into <a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x235a2228">C</a> procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.</p>

<p>SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x236e14a0">MonetDB</a> or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.</p>

<p>We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.</p>

<p>If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.</p>

<p>Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better <a href="http://dbpedia.org/resource/Cache" id="link-id0x237eb508">cache</a> behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.</p>

<h2>Webpie</h2>

<p>Webpie from <a href="http://www.vu.nl/" id="link-id0x23811bf8">VU Amsterdam</a> and the <a href="http://www.larkc.eu/" id="link-id0x22ff8fe8">LarKC</a> EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x238482a0">OWL</a> Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.</p>

<p>Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.</p>

<p>The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL <code>INSERT … SELECT</code> statements until no new inserts are produced. The only requirement is that the <code>INSERT</code> statement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.</p>

<p>We have suggested such an experiment to the LarKC people. We will see.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1634">
  <rss:title>Suggested Extensions to the BSBM</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-21T21:13:39Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Below is a list of possible extensions to the Berlin SPARQL Benchmark. Our previous critique of BSBM consists of: The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF. Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales. An update stream would make the workload more realistic. We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics. So I am publishing the below as a starting point for discussion. BSBM Analytics Mix Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable. This can be a separate metric from the &quot;restricted&quot; BSBM score. Restricted means &quot;without a full scan with regexp&quot; which will dominate the whole metric at larger scales. Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for JOIN order and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible. The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload. For each country, list the top 10 product categories, ordered by the count of reviews from the country. Product with the most reviews during its first month on the market 10 products most similar to X, with similarity score based on the count of features in common Top 10 reviewers of category X Product with largest increase in reviews in month X compared to month X-minus-1. Product of category X with largest change in mean price in the last month Most active American reviewer of Japanese cameras last year Correlation of price and average review Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers Fans of manufacturer — find top reviewers who score manufacturer above their mean score Products sold only in country X BSBM IR Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries. For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload. Q6 from the original mix, now allowing use of text index. Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg. ibid but now specifying review author. The intent is that structured criteria are here more selective than text. Difference in the frequency of use of &quot;awesome&quot;, &quot;super&quot;, and &quot;suck(s)&quot; by American vs. European vs. Asian review authors. Changes to Test Driver For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this. The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries. The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics. Changes to Data Generation For supporting the IR mix, reviews should, in addition to random text, contain the following: For each feature in the product concerned, add the label of said feature to 60% of the reviews. Add the names of review author, product, product category, and manufacturer. The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms. Skew the review scores so that comparatively expensive products have a smaller chance for a bad review. Update Stream During the benchmark run: 1% of products are added; 3% of initial offers are deleted and 3% are added; and 5% of reviews are added. Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed. The initial bulk load does not have to be transactional in any way. Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in READ COMMITTED isolation, so that half-inserted products or offers are not seen. Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed. The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads. The data generator should generate multiple files for the initial dump in order to facilitate parallel loading. The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Below is a list of possible extensions to the <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x236e9d38">Berlin SPARQL Benchmark</a>. 
Our previous critique of <a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x237b63c0">BSBM</a> consists of:</p>
<ol>
 <li>
  <p>The queries touch very little <a href="http://dbpedia.org/resource/Data" id="link-id0x23845418">data</a>, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x237a80a0">RDF</a>. </p>
 </li>
<li>
  <p>Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.</p>
</li>
<li>
  <p>An update stream would make the workload more realistic.</p>
</li>
</ol>

<p>We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.</p>

<p>So I am publishing the below as a starting point for discussion.</p>

<h2>BSBM Analytics Mix</h2>

<p>Below is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and <i>n * log(n)</i> to the data size. The <a href="http://www.tpc.org/" id="link-id0x2381e420">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x2380e7c8">H</a> rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.</p>

<p>This can be a separate metric from the &quot;restricted&quot; BSBM score. Restricted means &quot;without a full scan with regexp&quot; which will dominate the whole metric at larger scales.</p>

<p>Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for <code>JOIN</code> order and the like are not allowed; queries must be declarative. We note that both <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x2380dc10">SPARQL</a> and <a href="http://dbpedia.org/resource/SQL" id="link-id0x237c98e8">SQL</a> implementations of the queries are possible.</p>

<p>The queries are ordered so that the first ones fill the <a href="http://dbpedia.org/resource/Cache" id="link-id0x236f8170">cache</a>. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.</p>

<ol>
 <li>
  <p>For each country, list the top 10 product categories, ordered by the count of reviews from the country.</p>
 </li>
<li>
  <p>Product with the most reviews during its first month on the market</p>
</li>
<li>
  <p>10 products most similar to X, with similarity score based on the count of features in common</p>
</li>
<li>
  <p>Top 10 reviewers of category X</p>
</li>
<li>
  <p>Product with largest increase in reviews in month X compared to month X-minus-1.</p>
</li>
<li>
  <p>Product of category X with largest change in mean price in the last month </p>
</li>
<li>
  <p>Most active American reviewer of Japanese cameras last year</p>
</li>
<li>
  <p>Correlation of price and average review</p>
</li>
<li>
  <p>Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature</p>
</li>
<li>
  <p>Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews</p>
</li>
<li>
  <p>Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers</p>
</li>
<li>
  <p>Fans of manufacturer — find top reviewers who score manufacturer above their mean score</p>
</li>
<li>
  <p>Products sold only in country X</p>
</li>
</ol>

<h2>BSBM IR</h2>

<p>Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.</p>

<p>For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.</p>

<ul>
 <li>
  <p>Q6 from the original mix, now allowing use of text index.</p>
 </li>
<li>
  <p>Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.</p>
</li>
<li>
  <p>ibid but now specifying review author. The intent is that structured criteria are here more selective than text.</p>
</li>
<li>
  <p>Difference in the frequency of use of &quot;awesome&quot;, &quot;super&quot;, and &quot;suck(s)&quot; by American vs. European vs. Asian review authors.</p>
</li>
</ul>

<h2>Changes to Test Driver</h2>

<p>For full text queries, the search terms have to be selected according to a realistic distribution. <a href="http://dbpedia.org/resource/Digital_Enterprise_Research_Institute" id="link-id0x2383bd48">DERI</a> has offered to provide a definition and possibly an implementation for this.</p>

<p>The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.</p>

<p>The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.</p>

<h2>Changes to Data Generation</h2>

<p>For supporting the IR mix, reviews should, in addition to random text, contain the following:</p>

<ul>
 <li>
  <p>For each feature in the product concerned, add the label of said feature to 60% of the reviews.</p>
 </li>
<li>
  <p>Add the names of review author, product, product category, and manufacturer.</p>
</li>
<li>
  <p>The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.</p>
</li>
<li>
  <p>Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.</p>
</li>
</ul>

<h2>Update Stream</h2>

<p>During the benchmark run:</p>

<ul>
 <li>
  <p>1% of products are added;</p>
 </li>
<li>
  <p>3% of initial offers are deleted and 3% are added; and </p>
</li>
<li>
  <p>5% of reviews are added.</p>
</li>
</ul>

<p>Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.</p>

<p>The initial bulk load does not have to be transactional in any way.</p>

<p>Loading the update stream must be transactional, guaranteeing that all <a href="http://dbpedia.org/resource/Information" id="link-id0x236f20f0">information</a> pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in <code>READ COMMITTED</code> isolation, so that half-inserted products or offers are not seen.</p>

<p>Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.</p>

<p>The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.</p>

<p>The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.</p>

<p>The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-21#1633">
  <rss:title>LOD2 Kick Off</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-21T21:13:03Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The LOD2 kick off meeting was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term. As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and RDF benchmarks in February. The LOD2 repository is a fusion of the OpenLink LOD Cloud Cache (which includes data from URIBurner and PingTheSemanticWeb) and Sindice, both hosted at DERI. The value-add compared to Sindice or the Virtuoso-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the SPARQL of Virtuoso. Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise key compression. Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage. As for benchmarks, I just compiled a draft of suggested extensions to the BSBM (Berlin SPARQL Benchmark). I talked about this with Peter Boncz and Chris Bizer, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational schema and that RDF offers no fundamental edge for the workload. There was a graph benchmark talk at the TPC workshop at VLDB 2010. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such. We did informally talk about a process for publishing with our colleagues from Franz and Ontotext at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware. Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-H in SQL and SPARQL. The SQL will be Virtuoso, MonetDB, and possibly VectorWise and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end. In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it. LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at CWI to RDF use cases. This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion. LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, Greenplum, or Vertica can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the Berkeley Orders Of Magnitude (BOOM) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed. From our viewpoint, the project&#39;s gains include: Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility. Ready to use toolbox for data integration, including schema alignment and resolution of coreference. Data discovery, summarization and visualization Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and linked data. In this respect the integration of results may be stronger than often seen in European large scale integrating projects. The use cases fit the development profile well: Wolters Kluwer will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology. Exalead will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources. The Open Knowledge Foundation will create a portal of all government published data for easy access by citizens. In all these cases, the integration requirements of schema alignment, resolution of identity, information extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The <a href="http://lod2.eu/" id="link-id0x22e06810">LOD2</a> <a href="http://lod2.eu/BlogPost/9-press-release-lod2-project-launch.html" id="link-id0x18c0c770">kick off meeting</a> was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term.</p>

<p>As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x22f9ba70">RDF</a> benchmarks in February.</p>

<p>The LOD2 repository is a fusion of the OpenLink <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x2378d288">LOD</a> <a href="http://lod.openlinksw.com/" id="link-id0x23908828">Cloud</a> <a href="http://dbpedia.org/resource/Cache" id="link-id0x2378e6c8">Cache</a> (which includes <a href="http://dbpedia.org/resource/Data" id="link-id0x237d7d20">data</a> from <a href="http://uriburner.com/" id="link-id0x237c9408">URIBurner</a> and <a href="http://www.pingthesemanticweb.com/" id="link-id0x235b03b0">PingTheSemanticWeb</a>) and <a href="http://sindice.com/" id="link-id0x22e24190">Sindice</a>, both hosted at <a href="http://dbpedia.org/resource/Digital_Enterprise_Research_Institute" id="link-id0x237b80f8">DERI</a>. The value-add compared to Sindice or the <a href="http://virtuoso.openlinksw.com" id="link-id0x237b63c0">Virtuoso</a>-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x237f7568">SPARQL</a> of Virtuoso.</p>

<p>Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise <a href="http://dbpedia.org/resource/Data_compression" id="link-id0x235b0c38">key compression</a>.</p>

<p>Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.</p>

<p>As for benchmarks, I just compiled <a href="http://www.openlinksw.com/weblogs/oerling/" id="link-id0x1c29e720">a draft of suggested extensions to the BSBM</a> (<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html" id="link-id0x22e31050">Berlin SPARQL Benchmark</a>). I talked about this with <a href="http://nl.linkedin.com/in/peterboncz" id="link-id0x237c90b0">Peter Boncz</a> and <a href="http://data.semanticweb.org/person/christian-bizer" id="link-id0x23813340">Chris Bizer</a>, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x236f7ef8">schema</a> and that RDF offers no fundamental edge for the workload.</p>

<p>There was a graph benchmark talk at the <a href="http://www.tpc.org/" id="link-id0x236f8170">TPC</a> workshop at <a href="http://www.vldb2010.org/" id="link-id0x235c6b90">VLDB 2010</a>. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.</p>

<p>We did informally talk about a process for publishing with our colleagues from <a href="http://semanticweb.org/id/Franz_Inc" id="link-id0x23781d28">Franz</a> and <a href="http://dbpedia.org/resource/Ontotext" id="link-id0x23782740">Ontotext</a> at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.</p>

<p>Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x235a3568">H</a> in <a href="http://dbpedia.org/resource/SQL" id="link-id0x22e67370">SQL</a> and SPARQL. The SQL will be Virtuoso, <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x22e70db0">MonetDB</a>, and possibly <a href="http://www.ingres.com/vectorwise/" id="link-id0x2378f750">VectorWise</a> and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.</p>

<p>In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.</p>

<p>LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at <a href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x23911830">CWI</a> to RDF use cases.</p>

<p>This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.</p>

<p>LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x22feb520">Greenplum</a>, or <a href="http://www.vertica.com/" id="link-id0x237f7428">Vertica</a> can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x235c2e28">Berkeley Orders Of Magnitude</a> (<a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x2380e7c8">BOOM</a>) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.</p>

<p>From our viewpoint, the project&#39;s gains include:</p>

<ul>
 <li>
  <p>Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.</p>
 </li>
<li>
  <p>Ready to use toolbox for data integration, including schema alignment and resolution of coreference.</p>
</li>
<li>
  <p>Data discovery, summarization and visualization</p>
</li>
</ul>

<p>Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x236e14a0">linked data</a>. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.</p>

<p>The use cases fit the development profile well: </p>
<ul>
 <li>
  <p>
    <a href="http://dbpedia.org/resource/Wolters_Kluwer" id="link-id0x23820568">Wolters Kluwer</a> will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.</p>
 </li>
<li>
  <p>
    <a href="http://dbpedia.org/resource/Exalead" id="link-id0x22e50ba0">Exalead</a> will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.</p>
</li>
<li>
  <p>The Open <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x236fb248">Knowledge</a> Foundation will create a portal of all government published data for easy access by citizens.</p>
</li>
</ul>

<p>In all these cases, the integration requirements of schema alignment, resolution of identity, <a href="http://dbpedia.org/resource/Information" id="link-id0x2381ebb0">information</a> extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-13#1629">
  <rss:title>Perseus, Andromeda, and RDF</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-13T22:10:12Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">It has been several months since my last blog post. In this day and age of the attention economy, what gives me the insolence so to neglect my duty to mindshare? Well, Perseus wasn&#39;t blogging or checking his email either, when he went to fetch the Gorgon&#39;s head. As Joseph Campbell puts it, the hero breaks into a world separate from the ordinary in order to bring back a blessing which will revitalize the community. Thus, I deliberately withdrew from the public conversation, in faith that it would take care of itself and that I would still not be altogether forgotten. As it happens, I was confirmed in this when recently invited to submit a talk for the Semdata workshop at VLDB 2010. Great deeds are not only personal accomplishments but also play a role in a broader context. The quest may appear remote and difficult to execute but its outcome can be quite tangible: Andromeda needed no elaborate sales pitch to convince her of the advantages of not being eaten by the sea serpent. Thus right after the meeting in Sofia last March, I followed the vertical treasure map into the realm of first principles. As Perseus received advice from Athena, so was I informed by the Platonic ideas of locality and concurrency. The great quests have an outer and inner aspect. Likewise here, bringing the ideas to physical reality gave me a great deal of material on cognitive function itself. For human and computer alike, it appears that the main reason why anything at all works is cache. Locality and parallelism again. Maybe I will say something more about memory, attention, interface, and paradigm some other time. On the other hand, such material is bound to be unpopular even if valid. By now, you may ask yourself what I am talking about. We remember that Andromeda&#39;s fix was due to her mother, Cassiopeia, having claimed greater beauty than the daughters of the sea-god Poseidon. To transpose the archetype into the present, it is like Tim B-L saying that OWLs (by the way sacred to Athena) are more semantic than Codd&#39;s brainchild. Yet the relational community sees RDF as something not quite serious. A matter of scale(s) — just think of the sea serpent. So, I am talking about what I alluded to in the 2010 New Year&#39;s statement on this blog: RDF as a viable alternative to relational for big data. This means that RDF is no longer a specialty niche where, due to the hopeless task of bringing everything into a relational model, the fact of everything taking several times both the time and space is tolerated because there is no real alternative. The value proposition is that for any current RDF user, the present assets will go four times farther than before with the next release of Virtuoso. For a prospective RDF user, the cost of keeping an ETLed RDF integration warehouse is now in the same ballpark as the relational cost, except that schema is now flexible, and the time to integrate and answer is accordingly shorter. For users of analytics-oriented RDBMS, the next Virtuoso is a full cluster-capable SQL column store. Its merits compared to others in this space will be published later with benchmarks like TPC-H. As an extra bonus for such users, Virtuoso brings SQL federation and a growth path to RDF, should this become interesting. This is accomplished by introducing a new column-wise compressed-storage engine with corresponding changes to query execution. The general principles are explained in Daniel Abadi&#39;s famous Ph.D. thesis. The compression is tuned by the data itself, without user intervention. Further, our implementation remains capable of run-time-typing, thus the column-store advantages to RDF are obtained without going to a task-specific schema. But since data types, even if determined at run-time, are still in practice repetitive, the advantages of running on homogenous vectors are not lost. When storing an RDF extraction of TPC-H data, we get a storage usage of 6.3 bytes per quad. If you do not care about queries where the predicate is unspecified, the storage requirement drops to 4.7 bytes per quad. Whether storing the data as RDF quads or as Vertica-style multicolumn projections, the working set is about the same. Since having enough of the data in memory is the sine qua non prerequisite of flexible querying, the point is made. QED. In Virtuoso also, relational remains a bit faster but a penalty of 1.3x or so for RDF is quite tolerable, considering that a priori schema is no longer needed. This means that we are coming into an age where the warehouse becomes an ad hoc asset, to be filled with RDF, without the need to develop an a priori universal schema for all data one may ever wish to integrate, now or in the future. The data can be stored as RDF and projected from there into any form that may be needed at any time, whether the target format is more RDF or a task-specific relational schema. Availability is planned for late 2010, first as a Virtuoso Open Source preview.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>It has been several months since <a href="http://www.openlinksw.com/weblog/oerling/?id=1622" id="link-id0x1c86c3a0">my last</a> <a href="http://dbpedia.org/resource/Blog" id="link-id0x15fa418">blog</a> post. In this day and age of the attention economy, what gives me the insolence so to neglect my duty to mindshare?</p>

<p>Well, Perseus wasn&#39;t blogging or checking his email either, when he went to fetch the Gorgon&#39;s head. As Joseph Campbell puts it, the hero breaks into a world separate from the ordinary in order to bring back a blessing which will revitalize the community.</p>

<p>Thus, I deliberately withdrew from the public conversation, in faith that it would take care of itself and that I would still not be altogether forgotten. As it happens, I was confirmed in this when recently invited to submit a talk for the <a href="http://semdata.org/events/2010/vldb" id="link-id0x1c319338">Semdata workshop</a> at <a href="http://www.vldb2010.org" id="link-id0x1b334640">VLDB 2010</a>.</p>

<p>Great deeds are not only personal accomplishments but also play a role in a broader context. The quest may appear remote and difficult to execute but its outcome can be quite tangible: Andromeda needed no elaborate sales pitch to convince her of the advantages of not being eaten by the sea serpent.</p>

<p>Thus right after <a href="http://www.openlinksw.com/weblog/oerling/?id=1614" id="link-id0x15fd6968">the meeting in Sofia last March</a>, I followed the vertical treasure map into the realm of first principles. As Perseus received advice from Athena, so was I informed by the Platonic ideas of locality and concurrency.</p>

<p>The great quests have an outer and inner aspect. Likewise here, bringing the ideas to physical reality gave me a great deal of material on cognitive function itself. For human and computer alike, it appears that the main reason why anything at all works is <a href="http://dbpedia.org/resource/Cache" id="link-id0x140f1dc0">cache</a>. Locality and parallelism again. Maybe I will say something more about memory, attention, interface, and paradigm some other time. On the other hand, such material is bound to be unpopular even if valid.</p>

<p>By now, you may ask yourself what I am talking about.</p>

<p>We remember that Andromeda&#39;s fix was due to her mother, Cassiopeia, having claimed greater beauty than the daughters of the sea-god Poseidon. To transpose the archetype into the present, it is like Tim B-L saying that OWLs (by the way sacred to Athena) are more semantic than Codd&#39;s brainchild. Yet the relational community sees <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1d241648">RDF</a> as something not quite serious. A matter of scale(s) — just think of the sea serpent.</p>

<p>So, I am talking about what I alluded to in the <a href="http://www.openlinksw.com/weblog/oerling/?id=1603" id="link-id0x19182b88">2010 New Year&#39;s statement on this blog</a>: RDF as a viable alternative to relational for big <a href="http://dbpedia.org/resource/Data" id="link-id0x1b896350">data</a>. This means that RDF is no longer a specialty niche where, due to the hopeless task of bringing everything into a relational model, the fact of everything taking several times both the time and space is tolerated because there is no real alternative.</p>

<p>The value proposition is that for any current RDF user, the present assets will go four times farther than before with the next release of <a href="http://virtuoso.openlinksw.com" id="link-id0x6a223b0">Virtuoso</a>. For a prospective RDF user, the cost of keeping an ETLed RDF integration warehouse is now in the same ballpark as the relational cost, except that <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x15f8ed8">schema</a> is now flexible, and the time to integrate and answer is accordingly shorter. For users of analytics-oriented <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x8bc44d8">RDBMS</a>, the next Virtuoso is a full cluster-capable <a href="http://dbpedia.org/resource/SQL" id="link-id0x127faf40">SQL</a> column store. Its merits compared to others in this space will be published later with benchmarks like <a href="http://www.tpc.org/" id="link-id0x6af7ae0">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x1d46f230">H</a>. As an extra bonus for such users, Virtuoso brings SQL federation and a growth path to RDF, should this become interesting.</p>

<p>This is accomplished by introducing a new column-wise compressed-storage engine with corresponding changes to query execution. The general principles are explained in <a href="http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf" id="link-id0x1d259a88">Daniel Abadi&#39;s famous Ph.D. thesis</a>. The compression is tuned by the data itself, without user intervention. Further, our implementation remains capable of run-time-typing, thus the column-store advantages to RDF are obtained without going to a task-specific schema. But since data types, even if determined at run-time, are still in practice repetitive, the advantages of running on homogenous vectors are not lost.</p>

<p>When storing an RDF extraction of TPC-H data, we get a storage usage of 6.3 bytes per quad. If you do not care about queries where the predicate is unspecified, the storage requirement drops to 4.7 bytes per quad. Whether storing the data as RDF quads or as Vertica-style multicolumn projections, the working set is about the same. Since having enough of the data in memory is the <i>sine qua non</i> prerequisite of flexible querying, the point is made. QED.</p>

<p>In Virtuoso also, relational remains a bit faster but a penalty of 1.3x or so for RDF is quite tolerable, considering that <i>a priori</i> schema is no longer needed.</p>

<p>This means that we are coming into an age where the warehouse becomes an <i>ad hoc</i> asset, to be filled with RDF, without the need to develop an <i>a priori</i> universal schema for all data one may ever wish to integrate, now or in the future. The data can be stored as RDF and projected from there into any form that may be needed at any time, whether the target format is more RDF or a task-specific relational schema.</p>

<p>Availability is planned for late 2010, first as a Virtuoso Open Source preview.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-09-13#1628">
  <rss:title>VLDB Semdata Workshop - The New Frontier of Semdata </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-09-13T22:09:24Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">This is a revised version of the talk I will be giving at the Semdata workshop at VLDB 2010. The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso. A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases. The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying. The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives. The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption. I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso. But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores. Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest. The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce. Besides, for some things that go beyond SQL (for example, with graph structures), there really isn&#39;t a good solution. Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible. Here we find the next frontier of Semdata. Take Joe Hellerstein et al&#39;s work on declarative logic for the data centric data center. We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, à la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS. Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability. Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, à la MonetDB, are applicable with minimal if any adaptation. Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data. In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power. Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>This is a revised version of the talk I will be giving at the <a href="http://semdata.org/events/2010/vldb" id="link-id0x1d137fe0">Semdata workshop</a> at <a href="http://www.vldb2010.org/" id="link-id0x2533b280">VLDB 2010</a>.</p>

<p>
<a href="http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtDirectionsChallengesSemdata" id="link-id0x1cff6678">The paper</a> shows how we store <a href="http://www.tpc.org/" id="link-id0x244a65c0">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x25136af8">H</a> <a href="http://dbpedia.org/resource/Data" id="link-id0x259a6460">data</a> as <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x268767b0">RDF</a> with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in <a href="http://virtuoso.openlinksw.com" id="link-id0x2596dbc8">Virtuoso</a>.</p>

<p>A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.</p>

<p>The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x2681bb10">RDBMS</a>, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x2493ca50">schema</a> but limited querying.</p>

<p>The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives.</p>

<p>The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.</p>

<p>I do not need to talk here about the benefits of <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x245f72e8">linked data</a> and more or less <i>ad hoc</i> integration <i>per se</i>. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.</p>

<p>But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.</p>

<p>Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.</p>

<p>The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed <a href="http://dbpedia.org/resource/ACID" id="link-id0x25c9a008">ACID</a> transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce.</p>

<p>Besides, for some things that go beyond <a href="http://dbpedia.org/resource/SQL" id="link-id0x25149078">SQL</a> (for example, with graph structures), there really isn&#39;t a good solution.</p>

<p>Now, enter <a href="http://www.vertica.com/" id="link-id0x268ecb90">Vertica</a>, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x25954eb8">Greenplum</a>, <a href="http://www.ingres.com/vectorwise/" id="link-id0x28cac500">VectorWise</a> (a <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x28c239f8">MonetDB</a> project derivative from <a href="http://dbpedia.org/resource/Ingres" id="link-id0x24a2f498">Ingres</a>) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.</p>

<p>Here we find the next frontier of Semdata. Take <a href="http://dbpedia.org/resource/Joseph_M._Hellerstein" id="link-id0x257db7c0">Joe Hellerstein</a> et al&#39;s work on <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.html" id="link-id0x1c64ba98">declarative logic for the data centric data center</a>.</p>

<p>We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, <i>à la</i> <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x29affc18">BOOM</a> and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.</p>

<p>Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.</p>

<p>Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, <i>à la</i> MonetDB, are applicable with minimal if any adaptation.</p>

<p>Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with <a href="http://community.linkeddata.org/dataspace/organization/lod#this" id="link-id0x268168c8">linked open data</a> to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.</p>

<p>In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.</p>

<p>Last week I was at the <a href="http://lod2.eu/" id="link-id0x29e23be8">LOD2</a> <a href="http://lod2.eu/BlogPost/9-press-release-lod2-project-launch.html" id="link-id0x1aec1c10">kick off</a> and a <a href="http://www.larkc.eu/" id="link-id0x245f1168">LarKC</a> meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-14#1623">
  <rss:title>Transactional High Availability in Virtuoso Cluster Edition</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-04-14T22:21:52Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Introduction This post discusses the technical specifics of how we accomplish smooth transactional operation in a database server cluster under different failure conditions. (A higher-level short version was posted last week.) The reader is expected to be familiar with the basics of distributed transactions. Someone on a cloud computing discussion list called two-phase commit (2PC) the &quot;anti-availability protocol.&quot; There is indeed a certain anti-SQL and anti-2PC sentiment out there, with key-value stores and &quot;eventual consistency&quot; being talked about a lot. Indeed, if we are talking about wide-area replication over high-latency connections, then 2PC with synchronously-sharp transaction boundaries over all copies is not really workable. For multi-site operations, a level of eventual consistency is indeed quite unavoidable. Exactly what the requirements are depends on the application, so I will focus here on operations inside one site. The key-value store culture seems to focus on workloads where a record is relatively self-contained. The record can be quite long, with repeating fields, different selections of fields in consecutive records, and so forth. Such a record would typically be split over many tables of a relational schema. In the RDF world, such a record would be split even wider, with the information needed to reconstitute the full record almost invariably split over many servers. This comes from the mapping between the text of URIs and their internal IDs being partitioned in one way, and the many indices on the RDF quads each in yet another way. So it comes to pass that in the data models we are most interested in, the application-level entity (e.g., a user account in a social network) is not a contiguous unit with a single global identifier. The social network user account, that the key-value store would consider a unit of replication mastering and eventual consistency, will be in RDF or SQL a set of maybe hundreds of tuples, each with more than one index, nearly invariably spanning multiple nodes of the database cluster. So, before we can talk about wide-area replication and eventual consistency with application-level semantics, we need a database that can run on a fair-sized cluster and have cast-iron consistency within its bounds. If such a cluster is to be large and is to operate continuously, it must have some form of redundancy to cover for hardware failures, software upgrades, reboots, etc., without interruption of service. This is the point of the design space we are tackling here. Non Fault-Tolerant Operation There are two basic modes of operation we cover: bulk load, and online transactions. In the case of bulk load, we start with a consistent image of the database; load data; and finish by making another consistent image. If there is a failure during load, we lose the whole load, and restart from the initial consistent image. This is quite simple and is not properly transactional. It is quicker for filling a warehouse but is not to be used for anything else. In the remainder, we will only talk about online transactions. When all cluster nodes are online, operation is relatively simple. Each entry of each index belongs to a partition that is determined by the values of one or more partitioning columns of said index. There are no tables separate from indices; the relational row is on the index leaf of its primary key. Secondary indices reference the row by including the primary key. Blobs are in the same partition as the row which contains the blob. Each partition is then stored on a &quot;cluster node.&quot; In non fault-tolerant operations, each such cluster node is a single process with exclusive access to its own permanent storage, consisting of database files and logs; i.e., each node is a single server instance. It does not matter if the storage is local or on a SAN, the cluster node is still the only one accessing it. When things are not fault tolerant, transactions work as follows: When there are updates, two-phase commit is used to guarantee a consistent result. Each transaction is coordinated by one cluster node, which issues the updates in parallel to all cluster nodes concerned. Sending two update messages instead of one does not significantly impact latency. The coordinator of each transaction is the primary authority for the transaction&#39;s outcome. If the coordinator of the transaction dies between the phases of the commit, the transaction branches stay in the prepared state until the coordinator is recovered and can be asked again about the outcome of the transaction. Likewise, if a non-coordinating cluster node with a transaction branch dies between the phases, it will do a roll-forward and ask the coordinator for the outcome of the transaction. If cluster nodes occasionally crash and then recover relatively quickly, without ever losing transaction logs or database files, this is resilient enough. Everything is symmetrical; there are no cluster nodes with special functions, except for one master node that has the added task of resolving distributed deadlocks. I suppose our anti-SQL person called 2PC &quot;anti-availability&quot; because in the above situation we have the following problems: if any one cluster node is offline, it is quite likely that no transaction can be committed. This is so unless the data is partitioned on a key with application semantics, and all data touched by a transaction usually stays within a single partition. Then operations could proceed on most of the data while one cluster node was recovering. But, especially with RDF, this is never the case, since keys are partitioned in ways that have nothing to do with application semantics. Further, if one uses XA or Microsoft DTC with the monitor on a single box, this box can become a bottleneck and/or a single point of failure. (Among other considerations, this is why Virtuoso does not rely on any such monitor.) Further, if a cluster node dies never to be heard of again, leaving prepared but uncommitted transaction branches, the rest of the system has no way of telling what to do with them, again unless relying on a monitor that is itself liable to fail. If transactions have a real world counterpart, it is possible, at least in theory, to check the outcome against the real world state: One can ask a customer if an order was actually placed or a shipment delivered. But when a transaction has to do with internal identifiers of things, for example whether mailto://plaidskirt@hotdate.com has internal ID 0xacebabe, such a check against external reality is not possible. Fault-Tolerant Operation In a fault tolerant setting, we introduce the following extra elements: Cluster nodes are comprised of &quot;quorums&quot; of mutually-mirroring server instances. Each such quorum holds a partition of the data. Such a quorum typically consists of two server instances, but may have three for extra safety. If all server instances in the quorum are offline, then the cluster node is offline, and the cluster is not fully operational. If at least one server instance in a quorum is online, then the cluster node is online, and the cluster is operational and can process new transactions. We designate one cluster node (i.e., one quorum of 2 or 3 server instances) to act as a master node, and we set an order of precedence among its member instances. In addition to arbitrating distributed deadlocks, the master instance on duty will handle reports of server instance failures, and answer questions about any transactions left hanging in prepared state by a dead transaction coordinator. If the master on duty fails, the next master in line will either notice this itself in the line of normal business or get a complaint from another server instance about not being able to contact the previous master. There is no global heartbeat messaging per se, but since connections between server instances are reused long-term, a dropped connection will be noticed and the master on duty will be notified. If all masters are unavailable, that entire quorum (i.e., the master node) is offline and thus (as with any entire node going offline) most operations will fail anyway, unless by chance they do not hit any data managed by that failed quorum. When it receives a notice of unavailability, the master instance on duty tries to contact the unavailable server instance and if it fails, it will notify all remaining instances that that server instance is removed from the cluster. The effect is that the remaining server instances will stop attempting to access the failed instance. Updates to the partitions managed by the failed server instance are no longer sent to it, which results in updates to this data succeeding, as they are made against the other server instances in that quorum. Updates to the data of the failed server instance will fail in the window of time between the actual failure and the removal, which is typically well under a second. The removal of a failed server instance is delegated to a central authority in order not to have everybody get in each other&#39;s way when trying to effect the removal. If the failed server instance left prepared uncommitted transactions behind, the server instances having such branches will in due order contact the transaction coordinator to ask what should be done. This is a normal procedure for dealing with possibly dropped commit or rollback messages. When they discover that the coordinator has been removed, the master on duty will be contacted instead. Each prepare message of a transaction lists all the server instances participating in the transaction; thus the master can check whether each has received the prepare. If all have the prepare and none has an abort, the transaction is committed. The dead coordinator may not know this or may indeed not have the transaction logged, since it sends the prepares before logging its own prepare. The recovery will handle this though. We note that of the remaining branches, there is at least one copy of the branch with the failed server instance, or else we would have a whole quorum failed. In cases where there are branches participating in an unresolved transaction where all the quorum members have failed, the system cannot decide the outcome, and will periodically retry until at least one member of the failed quorum becomes available. The most complex part of the protocol is the recovery of a failed server instance. The recovery starts with a normal roll forward from the local transaction log. After this, the server instance will contact the master on duty to ask for its state. Typically, the master will reply that the recovering server instance had been removed and is out of date. When this is established, the recovering server instance will contact a live member of its quorum and ask for sync. The failed server instance has an approximate timestamp of its last received transaction. It knows this from the roll forward, where time markers are interspersed now and then between transaction records. The live partner then sends its transaction log(s) covering the time from a few seconds before the last transaction of the failed partner up to the present. A few transactions may get rolled forward twice but this does no harm, since these records have absolute values and no deltas and the second insert of a key is simply ignored. When the sender of the log reaches its last committed log entry, it asks the recovering server instance to confirm successful replay of the log so far. Having the confirmation, the sender will abort all unprepared transactions affecting it and will not accept any new ones until the sync is completed. If new transactions were committed between sending the last of the log and killing the uncommitted new transactions, these too are shipped to the recovering server instance in their committed or prepared state. When these are also confirmed replayed, the recovering server instance is in exact sync up to the transaction. The sender then notifies the rest of the cluster that the sync is complete and that the recovered server instance will be included in any updates of its slice of the data. The time between freeze and re-enable of transactions is the time to replay what came in between the first sync and finishing the freeze. Typically nothing came in, so the time is in milliseconds. If an application got its transaction killed in this maneuver, it will be seen as a deadlock. If the recovering server instance received transactions in prepared state, it will ask about their outcome as a part of the periodic sweep through pending transactions. One of these transactions could have been one originally prepared by itself, where the prepares had gone out before it had time to log the transaction. Thus, this eventuality too is covered and has a consistent outcome. Failures can interrupt the recovery process. The recovering server instance will have logged as far as it got, and will pick up from this point onward. Real time clocks on the host nodes of the cluster will have to be in approximate sync, within a margin of a minute or so. This is not a problem in a closely connected network. For simultaneous failure of a entire quorum of server instances (i.e., a set of mutually-mirroring partners; a cluster node), the rule is that the last one to fail must be the first to come back up. In order to have uninterrupted service across arbitrary double failures, one must store things in triplicate; statistically, however, most double failures will not hit cluster nodes of the same group. The protocol for recovery of failed server instances of the master quorum (i.e., the master cluster node) is identical, except that a recovering master will have to ask the other master(s) which one is more up to date. If the recovering master has a log entry of having excluded all other masters in its quorum from the cluster, it can come back online without asking anybody. If there is no such entry, it must ask the other master(s). If all had failed at the exact same instant, none has an entry of the other(s) being excluded and all will know that they are in the same state since any update to one would also have been sent to the other(s). Failure of Storage Media When a server instance fails, its permanent storage may or may not survive. Especially with mirrored disks, storage most often survives a failure. However, the survival of the database does not depend on any single server instance retaining any permanent storage over failure. If storage is left in place, as in the case of an OS reboot or replacing a faulty memory chip, rejoining the cluster is done based on the existing copy of the database on the server instance. if there is no existing copy, a copy can be taken from any surviving member of the same quorum. This consists of the following steps: First, a log checkpoint is forced on the surviving instance. Normally log checkpoints are done at regular intervals, independently on each server instance. The log checkpoint writes a consistent state of the database to permanent storage. The disk pages forming this consistent image will not be written to until the next log checkpoint. Therefore copying the database file is safe and consistent as long as a log checkpoint does not take place between the start and end of copy. Thus checkpoints are disabled right after the initial checkpoint. The copy can take a relatively long time; consider 20s per gigabyte on a 1GbE network a good day. At the end of copy, checkpoints are re-enabled on the surviving cluster node. The recovering database starts without a log, sees the timestamp of the checkpoint in the database, and asks for transactions from just before this time up to present. The recovery then proceeds as outlined above. Network Failures The CAP theorem states that Consistency, Availability, and Partition-tolerance do not mix. &quot;Partition&quot; here means the split of a network. It is trivially true that if the network splits so that on both sides there is a copy of each partition of the data, both sides will think themselves the live copy left online after the other died, and each will thus continue to accumulate updates. Such an event is not very probable within one site where all machines are redundantly connected to two independent switches. Most servers have dual 1GbE on the motherboard, and both ports should be used for cluster interconnect for best performance, with each attached to an independent switch. Both switches would have to fail in such a way as to split their respective network for a single-site network split to happen. Of course, the likelihood of a network split in multi-site situations is higher. One way of guarding against network splits is to require that at least one partition of the data have all copies online. Additionally, the master on duty can request each cluster node or server instance it expects to be online to connect to every other node or instance, and to report which they could reach. If the reports differ, there is a network problem. This procedure can be performed using both interfaces or only the first or second interface of each server to determine if one of the switches selectively blocks some paths. These simple sanity checks protect against arbitrary network errors. Using TCP for inter-cluster-node communication in principle protects against random message loss, but the Virtuoso cluster protocols do not rely on this. Instead, there are protocols for retry of any transaction messages and for using keep-alive messages on any long-running functions sent across the cluster. Failure to get a keep-alive message within a certain period will abort a query even if the network connections look OK. Backups, and Recovery from Loss of Entire Site For a constantly-operating distributed system, it is hard to define what exactly constitutes a consistent snapshot. The checkpointed state on each cluster node is consistent as far as this cluster node is concerned (i.e., it contains no uncommitted data), but the checkpointed states on all the cluster nodes are not from exactly the same moment in time. The complete state of a cluster is the checkpoint state of each cluster node plus the current transaction log of each. If the logs were shipped in real time to off-site storage, a consistent image could be reconstructed from them. Since such shipping cannot be synchronous due to latency considerations, some transactions could be received only in part in the event of a failure of the off-site link. Such partial transactions can however be detected at reconstruction time because each record contains the list of all participants of the transaction. If some piece is found missing, the whole can be discarded. In this way integrity is guaranteed but it is possible that a few milliseconds worth of transactions get lost. In these cases, the online client will almost certainly fail to get the final success message and will recheck the status after recovery. For business continuity purposes, a live feed of transactions can be constantly streamed off-site, for example to a cloud infrastructure provider. One low-cost virtual machine on the cloud will typically be enough for receiving the feed. In the event of long-term loss of the whole site, replacement servers can be procured on the cloud; thus, capital is not tied up in an aging inventory of spare servers. The cloud-based substitute can be maintained for the time it takes to rebuild an owned infrastructure, which is still at present more economical than a cloud-only solution. Switching a cluster from an owned site to the cloud could be accomplished in a few hours. The prerequisite of this is that there are reasonably recent snapshots of the database files, so that replay of logs does not take too long. The bulk of the time taken by such a switch would be in transferring the database snapshots from S3 or similar to the newly provisioned machines, formatting the newly provisioned virtual disks, etc. Rehearsing such a maneuver beforehand is quite necessary for predictable execution. We do not presently have a productized set of tools for such a switch, but can advise any interested parties on implementing and testing such a disaster recovery scheme. Conclusions In conclusion, we have shown how we can have strong transactional guarantees in a database cluster without single points of failure or performance penalties when compared with a non fault-tolerant cluster. Operator intervention is not required for anything short of hardware failure. Recovery procedures are simple, at most consisting of installing software and copying database files from a surviving cluster node. Unless permanent storage is lost in the failure, not even this is required. Real-time off-site log shipment can easily be added to these procedures to protect against site-wide failures. Future work may be directed toward concurrent operation of geographically-distributed data centers with eventual consistency. Such a setting would allow for migration between sites in the event of whole-site failures, and for reconciliation between inconsistent histories of different halves of a temporarily split network. Such schemes are likely to require application-level logic for reconciliation and cannot consist of an out-of-the-box DBMS alone. All techniques discussed here are application-agnostic and will work equally well for Graph Model (e.g., RDF) and Relational Model (e.g., SQL) workloads. Glossary Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster. Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster. Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster. Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster. Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations). Also see Special Relativity and the Problem of Database Scalability (PDF), by James Starkey of NimbusDB, Inc.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>Introduction</h2>

<p>This post discusses the technical specifics of how we accomplish smooth transactional operation in a database server cluster under different failure conditions. (<a href="http://www.openlinksw.com/weblog/oerling/?id=1621" id="link-id0x198e8e68">A higher-level short version</a> was posted last week.) The reader is expected to be familiar with the basics of <a href="http://dbpedia.org/resource/Distributed_transaction" id="link-id0x25088028">distributed transactions</a>.</p>

<p>Someone on a cloud computing discussion list called <a href="http://dbpedia.org/resource/Two-phase_commit_protocol" id="link-id0x21addd50">two-phase commit</a> (<a href="http://dbpedia.org/resource/Two-phase_commit_protocol" id="link-id0x1eb6bc90">2PC</a>) the &quot;anti-availability protocol.&quot; There is indeed a certain anti-<a href="http://dbpedia.org/resource/SQL" id="link-id0x28e1dbd0">SQL</a> and anti-2PC sentiment out there, with key-value stores and &quot;eventual consistency&quot; being talked about a lot. Indeed, if we are talking about wide-area replication over high-latency connections, then 2PC with synchronously-sharp transaction boundaries over all copies is not really workable.</p>

<p>For multi-site operations, a level of <i>eventual</i> consistency is indeed quite unavoidable. Exactly what the requirements are depends on the application, so I will focus here on operations inside one site.</p>

<p>The key-value store culture seems to focus on workloads where a record is relatively self-contained. The record can be quite long, with repeating fields, different selections of fields in consecutive records, and so forth. Such a record would typically be split over many tables of a relational <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x216479f8">schema</a>. In the <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x672d740">RDF</a> world, such a record would be split even wider, with the <a href="http://dbpedia.org/resource/Information" id="link-id0x72c8ec0">information</a> needed to reconstitute the full record almost invariably split over many servers. This comes from the mapping between the text of URIs and their internal IDs being partitioned in one way, and the many indices on the RDF quads each in yet another way.</p>

<p>So it comes to pass that in the <a href="http://dbpedia.org/resource/Data" id="link-id0x216c6280">data</a> models we are most interested in, the application-level <a href="http://dbpedia.org/resource/Entity" id="link-id0x224444c0">entity</a> (<i>e.g.,</i> a user account in a social network) is not a contiguous unit with a single global identifier. The social network user account, that the key-value store would consider a unit of replication mastering and eventual consistency, will be in RDF or SQL a set of maybe hundreds of tuples, each with more than one index, nearly invariably spanning multiple nodes of the database cluster.</p>

<p>So, before we can talk about wide-area replication and eventual consistency with application-level semantics, we need a database that can run on a fair-sized cluster and have cast-iron consistency within its bounds. If such a cluster is to be large and is to operate continuously, it must have some form of redundancy to cover for hardware failures, software upgrades, reboots, etc., without interruption of service.</p>

<p>This is the point of the design space we are tackling here.</p>

<h2>Non Fault-Tolerant Operation</h2>

<p>There are two basic modes of operation we cover: bulk load, and online transactions.</p>

<p>In the case of bulk load, we start with a consistent image of the database; load data; and finish by making another consistent image. If there is a failure during load, we lose the whole load, and restart from the initial consistent image. This is quite simple and is not properly transactional. It is quicker for filling a warehouse but is not to be used for anything else. In the remainder, we will only talk about online transactions.</p>

<p>When all cluster nodes are online, operation is relatively simple. Each entry of each index belongs to a partition that is determined by the values of one or more partitioning columns of said index. There are no tables separate from indices; the relational row is on the index leaf of its primary key. Secondary indices reference the row by including the primary key. Blobs are in the same partition as the row which contains the blob. Each partition is then stored on a &quot;cluster node.&quot; In non fault-tolerant operations, each such cluster node is a single process with exclusive access to its own permanent storage, consisting of database files and logs; <i>i.e.,</i> each node is a single server instance. It does not matter if the storage is local or on a SAN, the cluster node is still the only one accessing it.</p>

<p>When things are not fault tolerant, transactions work as follows:</p>

<p>When there are updates, two-phase commit is used to guarantee a consistent result. Each transaction is coordinated by one cluster node, which issues the updates in parallel to all cluster nodes concerned. Sending two update messages instead of one does not significantly impact latency. The coordinator of each transaction is the primary authority for the transaction&#39;s outcome. If the coordinator of the transaction dies between the phases of the commit, the transaction branches stay in the prepared state until the coordinator is recovered and can be asked again about the outcome of the transaction. Likewise, if a non-coordinating cluster node with a transaction branch dies between the phases, it will do a roll-forward and ask the coordinator for the outcome of the transaction.</p>

<p>If cluster nodes occasionally crash and then recover relatively quickly, without ever losing transaction logs or database files, this is resilient enough. Everything is symmetrical; there are no cluster nodes with special functions, except for one master node that has the added task of resolving distributed deadlocks.</p>

<p>I suppose our anti-SQL person called 2PC &quot;anti-availability&quot; because in the above situation we have the following problems: if any one cluster node is offline, it is quite likely that no transaction can be committed. This is so unless the data is partitioned on a key with application semantics, and all data touched by a transaction usually stays within a single partition. Then operations could proceed on most of the data while one cluster node was recovering. But, especially with RDF, this is never the case, since keys are partitioned in ways that have nothing to do with application semantics. Further, if one uses XA or <a href="http://dbpedia.org/resource/Microsoft" id="link-id0x785bc50">Microsoft</a> DTC with the monitor on a single box, this box can become a bottleneck and/or a single point of failure. (Among other considerations, this is why <a href="http://virtuoso.openlinksw.com" id="link-id0x72a1ea8">Virtuoso</a> does not rely on any such monitor.) Further, if a cluster node dies never to be heard of again, leaving prepared but uncommitted transaction branches, the rest of the system has no way of telling what to do with them, again unless relying on a monitor that is itself liable to fail.</p>

<p>If transactions have a real world counterpart, it is possible, at least in theory, to check the outcome against the real world state: One can ask a customer if an order was actually placed or a shipment delivered. But when a transaction has to do with internal identifiers of things, for example whether <b><code>mailto://plaidskirt@hotdate.com</code></b> has internal ID <b><code>0xacebabe</code></b>, such a check against external reality is not possible.</p>

<h2>Fault-Tolerant Operation</h2>

<p>In a fault tolerant setting, we introduce the following extra elements: Cluster nodes are comprised of &quot;quorums&quot; of mutually-mirroring server instances. Each such quorum holds a partition of the data. Such a quorum typically consists of two server instances, but may have three for extra safety. If all server instances in the quorum are offline, then the cluster node is offline, and the cluster is not fully operational. If at least one server instance in a quorum is online, then the cluster node is online, and the cluster is operational and can process new transactions.</p>

<p>We designate one cluster node (<i>i.e.,</i> one quorum of 2 or 3 server instances) to act as a master node, and we set an order of precedence among its member instances. In addition to arbitrating distributed deadlocks, the master instance on duty will handle reports of server instance failures, and answer questions about any transactions left hanging in prepared state by a dead transaction coordinator. If the master on duty fails, the next master in line will either notice this itself in the line of normal business or get a complaint from another server instance about not being able to contact the previous master.</p>

<p>There is no global heartbeat messaging <i>per se,</i> but since connections between server instances are reused long-term, a dropped connection will be noticed and the master on duty will be notified. If all masters are unavailable, that entire quorum (<i>i.e.,</i> the master node) is offline and thus (as with any entire node going offline) most operations will fail anyway, unless by chance they do not hit any data managed by that failed quorum.</p>

<p>When it receives a notice of unavailability, the master instance on duty tries to contact the unavailable server instance and if it fails, it will notify all remaining instances that that server instance is removed from the cluster. The effect is that the remaining server instances will stop attempting to access the failed instance. Updates to the partitions managed by the failed server instance are no longer sent to it, which results in updates to this data succeeding, as they are made against the other server instances in that quorum. Updates to the data of the failed server instance <i>will</i> fail in the window of time between the actual failure and the removal, which is typically well under a second. The removal of a failed server instance is delegated to a central authority in order not to have everybody get in each other&#39;s way when trying to effect the removal.</p>

<p>If the failed server instance left prepared uncommitted transactions behind, the server instances having such branches will in due order contact the transaction coordinator to ask what should be done. This is a normal procedure for dealing with possibly dropped commit or rollback messages. When they discover that the coordinator has been removed, the master on duty will be contacted instead. Each prepare message of a transaction lists all the server instances participating in the transaction; thus the master can check whether each has received the prepare. If all have the prepare and none has an abort, the transaction is committed. The dead coordinator may not know this or may indeed not have the transaction logged, since it sends the prepares before logging its own prepare. The recovery will handle this though. We note that of the remaining branches, there is at least one copy of the branch with the failed server instance, or else we would have a whole quorum failed. In cases where there are branches participating in an unresolved transaction where all the quorum members have failed, the system cannot decide the outcome, and will periodically retry until at least one member of the failed quorum becomes available.</p>

<p>The most complex part of the protocol is the recovery of a failed server instance. The recovery starts with a normal roll forward from the local transaction log. After this, the server instance will contact the master on duty to ask for its state. Typically, the master will reply that the recovering server instance had been removed and is out of date. When this is established, the recovering server instance will contact a live member of its quorum and ask for sync. The failed server instance has an approximate timestamp of its last received transaction. It knows this from the roll forward, where time markers are interspersed now and then between transaction records. The live partner then sends its transaction log(s) covering the time from a few seconds before the last transaction of the failed partner up to the present. A few transactions may get rolled forward twice but this does no harm, since these records have absolute values and no deltas and the second insert of a key is simply ignored. When the sender of the log reaches its last committed log entry, it asks the recovering server instance to confirm successful replay of the log so far. Having the confirmation, the sender will abort all unprepared transactions affecting it and will not accept any new ones until the sync is completed. If new transactions were committed between sending the last of the log and killing the uncommitted new transactions, these too are shipped to the recovering server instance in their committed or prepared state. When these are also confirmed replayed, the recovering server instance is in exact sync up to the transaction. The sender then notifies the rest of the cluster that the sync is complete and that the recovered server instance will be included in any updates of its slice of the data. The time between freeze and re-enable of transactions is the time to replay what came in between the first sync and finishing the freeze. Typically nothing came in, so the time is in milliseconds. If an application got its transaction killed in this maneuver, it will be seen as a deadlock.</p>

<p>If the recovering server instance received transactions in prepared state, it will ask about their outcome as a part of the periodic sweep through pending transactions. One of these transactions could have been one originally prepared by itself, where the prepares had gone out before it had time to log the transaction. Thus, this eventuality too is covered and has a consistent outcome. Failures can interrupt the recovery process. The recovering server instance will have logged as far as it got, and will pick up from this point onward. Real time clocks on the host nodes of the cluster will have to be in approximate sync, within a margin of a minute or so. This is not a problem in a closely connected network.</p>

<p>For simultaneous failure of a entire quorum of server instances (<i>i.e.,</i> a set of mutually-mirroring partners; a cluster node), the rule is that the last one to fail must be the first to come back up. In order to have uninterrupted service across arbitrary double failures, one must store things in triplicate; statistically, however, most double failures will not hit cluster nodes of the same group.</p>

<p>The protocol for recovery of failed server instances of the master quorum (<i>i.e.,</i> the master cluster node) is identical, except that a recovering master will have to ask the other master(s) which one is more up to date. If the recovering master has a log entry of having excluded all other masters in its quorum from the cluster, it can come back online without asking anybody. If there is no such entry, it must ask the other master(s). If all had failed at the exact same instant, none has an entry of the other(s) being excluded and all will know that they are in the same state since any update to one would also have been sent to the other(s).</p>

<h2>Failure of Storage Media</h2>

<p>When a server instance fails, its permanent storage may or may not survive. Especially with mirrored disks, storage most often survives a failure. However, the survival of the database does not depend on any single server instance retaining any permanent storage over failure. If storage is left in place, as in the case of an OS reboot or replacing a faulty memory chip, rejoining the cluster is done based on the existing copy of the database on the server instance. if there is no existing copy, a copy can be taken from any surviving member of the same quorum. This consists of the following steps: First, a log checkpoint is forced on the surviving instance. Normally log checkpoints are done at regular intervals, independently on each server instance. The log checkpoint writes a consistent state of the database to permanent storage. The disk pages forming this consistent image will not be written to until the next log checkpoint. Therefore copying the database file is safe and consistent as long as a log checkpoint does not take place between the start and end of copy. Thus checkpoints are disabled right after the initial checkpoint. The copy can take a relatively long time; consider 20s per gigabyte on a 1GbE network a good day. At the end of copy, checkpoints are re-enabled on the surviving cluster node. The recovering database starts without a log, sees the timestamp of the checkpoint in the database, and asks for transactions from just before this time up to present. The recovery then proceeds as outlined above.</p>

<h2>Network Failures</h2>

<p>The CAP theorem states that Consistency, Availability, and Partition-tolerance do not mix. &quot;Partition&quot; here means the split of a network.</p>

<p>It is trivially true that if the network splits so that on both sides there is a copy of each partition of the data, both sides will think themselves the live copy left online after the other  died, and each will thus continue to accumulate updates. Such an event is not very probable within one site where all machines are redundantly connected to two independent switches. Most servers have dual 1GbE on the motherboard, and both ports should be used for cluster interconnect for best performance, with each attached to an independent switch. Both switches would have to fail in such a way as to split their respective network for a single-site network split to happen. Of course, the likelihood of a network split in multi-site situations is higher.</p>

<p>One way of guarding against network splits is to require that at least one partition of the data have all copies online. Additionally, the master on duty can request each cluster node or server instance it expects to be online to connect to every other node or instance, and to report which they could reach. If the reports differ, there is a network problem. This procedure can be performed using both interfaces or only the first or second interface of each server to determine if one of the switches selectively blocks some paths. These simple sanity checks protect against arbitrary network errors. Using TCP for inter-cluster-node communication in principle protects against random message loss, but the Virtuoso cluster protocols do not rely on this. Instead, there are protocols for retry of any transaction messages and for using keep-alive messages on any long-running functions sent across the cluster. Failure to get a keep-alive message within a certain period will abort a query even if the network connections look OK. </p>

<h2>Backups, and Recovery from Loss of Entire Site</h2>

<p>For a constantly-operating distributed system, it is hard to define what exactly constitutes a consistent snapshot. The checkpointed state on each cluster node is consistent as far as this cluster node is concerned (<i>i.e.,</i> it contains no uncommitted data), but the checkpointed states on all the cluster nodes are not from exactly the same moment in time. The complete state of a cluster is the checkpoint state of each cluster node plus the current transaction log of each. If the logs were shipped in real time to off-site storage, a consistent image could be reconstructed from them. Since such shipping cannot be synchronous due to latency considerations, some transactions could be received only in part in the event of a failure of the off-site link. Such partial transactions can however be detected at reconstruction time because each record contains the list of all participants of the transaction. If some piece is found missing, the whole can be discarded. In this way integrity is guaranteed but it is possible that a few milliseconds worth of transactions get lost. In these cases, the online client will almost certainly fail to get the final success message and will recheck the status after recovery.</p>

<p>For business continuity purposes, a live feed of transactions can be constantly streamed off-site, for example to a cloud infrastructure provider. One low-cost virtual machine on the cloud will typically be enough for receiving the feed. In the event of long-term loss of the whole site, replacement servers can be procured on the cloud; thus, capital is not tied up in an aging inventory of spare servers. The cloud-based substitute can be maintained for the time it takes to rebuild an owned infrastructure, which is still at present more economical than a cloud-only solution.</p>

<p>Switching a cluster from an owned site to the cloud could be accomplished in a few hours. The prerequisite of this is that there are reasonably recent snapshots of the database files, so that replay of logs does not take too long. The bulk of the time taken by such a switch would be in transferring the database snapshots from S3 or similar to the newly provisioned machines, formatting the newly provisioned virtual disks, etc.</p>

<p>Rehearsing such a maneuver beforehand is quite necessary for predictable execution. We do not presently have a productized set of tools for such a switch, but can advise any interested parties on implementing and testing such a disaster recovery scheme.</p>

<h2>Conclusions</h2>

<p>In conclusion, we have shown how we can have strong transactional guarantees in a database cluster without single points of failure or performance penalties when compared with a non fault-tolerant cluster. Operator intervention is not required for anything short of hardware failure. Recovery procedures are simple, at most consisting of installing software and copying database files from a surviving cluster node. Unless permanent storage is lost in the failure, not even this is required. Real-time off-site log shipment can easily be added to these procedures to protect against site-wide failures.</p>

<p>Future work may be directed toward concurrent operation of geographically-distributed data centers with eventual consistency. Such a setting would allow for migration between sites in the event of whole-site failures, and for reconciliation between inconsistent histories of different halves of a temporarily split network. Such schemes are likely to require application-level logic for reconciliation and cannot consist of an out-of-the-box DBMS alone. All techniques discussed here are application-agnostic and will work equally well for Graph Model (<i>e.g.,</i> RDF) and Relational Model (<i>e.g.,</i> SQL) workloads.</p>

<h3>
<a href="http://dbpedia.org/resource/Glossary" id="link-id0x24f4e378">Glossary</a>
</h3>

<ul>
<li>
  <b>Virtuoso Cluster (VC)</b> -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Cluster Node (VCN)</b> -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Host Cluster (VHC)</b> -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Host Cluster Node (VHCN)</b> -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Server Instance (VSI)</b> -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs.  May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).</li>
</ul>

<h3>Also see</h3>
<ul>
 <li>
  <a href="http://www.gbcacm.org/sites/www.gbcacm.org/files/slides/SpecialRelativity[1]_0.pdf" id="link-id0x16cb22d8">Special Relativity and the Problem of Database Scalability (PDF)</a>, by James Starkey of <a href="http://www.nimbusdb.com/" id="link-id0x18f30d58">NimbusDB, Inc.</a>
 </li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-07#1621">
  <rss:title>Fault Tolerance in Virtuoso Cluster Edition (Short Version)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-04-07T16:40:02Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have for some time had the option of storing data in a cluster in multiple copies, in the Commercial Edition of Virtuoso. (This feature is not in and is not planned to be added to the Open Source Edition.) Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work. So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post. Three types of individuals need to know about fault tolerance: Executives: What does it cost? Will it really eliminate downtime? System Administrators: Is it hard to configure? What do I do when I get an alert? Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes? I will explain the matter to each of these three groups: Executives The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money&#39;s worth of read throughput and half the money&#39;s worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity. Server instances are grouped in &quot;quorums&quot; of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site. The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics. Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind. System Administrators To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server&#39;s network cables to each switch, to cover switch failures. When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying. Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance. Application Developers/Programmers An application can connect to any server instance in the cluster and have access to the same data, with full ACID properties. There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock. For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed. For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed. Conclusion In summary... Limited extra cost for fault tolerance; no equipment sitting idle. Easy operation: Replace servers when they fail; the cluster does the rest. No changes needed to most applications. No proprietary SQL APIs or special fault tolerance logic needed in applications. Fully transactional programming model. All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies. Glossary Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster. Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster. Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster. Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster. Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations). Also see Special Relativity and the Problem of Database Scalability (PDF), by James Starkey of NimbusDB, Inc.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have for some time had the option of storing <a href="http://dbpedia.org/resource/Data" id="link-id0x28eb2178">data</a> in a cluster in multiple copies, in the Commercial Edition of <a href="http://virtuoso.openlinksw.com" id="link-id0x25178ed0">Virtuoso</a>. (This feature is not in and is not planned to be added to the Open Source Edition.)</p>

<p>Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x21fea428">knowledge</a> of how things actually work.</p>

<p>So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.</p>

<p>Three types of individuals need to know about fault tolerance:</p>

<ul>
<li>Executives: What does it cost? Will it really eliminate downtime?</li>
<li>System Administrators: Is it hard to configure? What do I do when I get an alert?</li>
<li>Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?</li>
</ul>

<p>I will explain the matter to each of these three groups:</p>

<h2>Executives</h2>

<p>The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money&#39;s worth of read throughput and half the money&#39;s worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.</p>

<p>Server instances are grouped in &quot;quorums&quot; of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.</p>

<p>The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please <a href="http://www.openlinksw.com/contact/" id="link-id0x2bdb0db8">contact us</a> for specifics.</p>

<p>Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.</p>

<h2> System Administrators</h2>

<p>To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server&#39;s network cables to each switch, to cover switch failures.</p>

<p>When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.</p>

<p>Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.</p>

<h2> Application Developers/Programmers</h2>

<p>An application can connect to any server instance in the cluster and have access to the same data, with full <a href="http://dbpedia.org/resource/ACID" id="link-id0x6451870">ACID</a> properties.</p>

<p>There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.</p>

<p>For the missing server instance, the application should try to reconnect. An <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id0x28e859b8">ODBC</a>/<a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id0x28e11940">JDBC</a> connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.</p>

<p>For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (<a href="http://dbpedia.org/resource/SQL" id="link-id0x2bda4e40">SQL</a> State 40001) as best practices dictate, there is no change needed.</p>

<h2>Conclusion</h2>

<p>In summary...</p>

<ul>
<li>Limited extra cost for fault tolerance; no equipment sitting idle.</li>
<li>Easy operation: Replace servers when they fail; the cluster does the rest.</li>
<li>No changes needed to most applications.</li>
<li>No proprietary SQL APIs or special fault tolerance logic needed in applications.</li>
<li>Fully transactional programming model.</li>
</ul>

<p>All the above applies to both the Graph Model (<a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x22606f10">RDF</a>) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please <a href="http://www.openlinksw.com/contact/" id="link-id0x24f35648">contact OpenLink Software</a> Sales for details of availability or for getting advance evaluation copies.</p>

<h3>
<a href="http://dbpedia.org/resource/Glossary" id="link-id0x6648890">Glossary</a>
</h3>

<ul>
<li>
  <b>Virtuoso Cluster (VC)</b> -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Cluster Node (VCN)</b> -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Host Cluster (VHC)</b> -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Host Cluster Node (VHCN)</b> -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.</li>
<li>
  <b>Virtuoso Server Instance (VSI)</b> -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs.  May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).</li>
</ul>

<h3>Also see</h3>
<ul>
 <li>
  <a href="http://www.gbcacm.org/sites/www.gbcacm.org/files/slides/SpecialRelativity[1]_0.pdf" id="link-id0x1320f1e8">Special Relativity and the Problem of Database Scalability (PDF)</a>, by James Starkey of <a href="http://www.nimbusdb.com/" id="link-id0x1320f2b0">NimbusDB, Inc.</a>
 </li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-05#1619">
  <rss:title>&quot;The Acquired, The Innate, and the Semantic&quot; or &quot;Teaching Sem Tech&quot;</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-04-05T15:21:19Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic. I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it. When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, &quot;working such magic that makes things do what they already want to do is easy.&quot; There is a grain of truth in that. In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife&#39;s grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such &quot;magic,&quot; needless to say, takes constant maintenance; else the spell breaks. To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow. Now, in more specific terms, what can we realistically expect to teach about computer science? Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third. Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much. Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time. I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the &quot;semanticist&quot; mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry. Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces. LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests. Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-&quot;paradigmatism&quot; given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles. I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it: The edge is created in the &quot;Wild West&quot; — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism&#39;s sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be &quot;driven out o&#39;Dodge.&quot; So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after. But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty. Know when to ontologize, when to folksonomize. The history of standards has examples of &quot;stacks of Babel,&quot; sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc. Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base. The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat. Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt. Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed. So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do? Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go. Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks. Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein. Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable. Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc. Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf. Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest. The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define. We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error. If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills. The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered. Related Linked Data Driven Data Virtualization for Web-scale Integration (presentation) Linked Data and Virtuoso in 2010 Getting The Linked Data Value Pyramid Layers Right Provenance and Reification in Virtuoso The Time for RDBMS Primacy Downgrade is Nigh! Aspects of RDF to RDF Mapping</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I <a href="http://www.openlinksw.com/weblog/oerling/?id=1614" id="link-id0x19c4f938">blogged about earlier</a>. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.</p>

<p>I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.</p>

<p>When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, &quot;working such magic that makes things do what they already want to do is easy.&quot; There is a grain of truth in that.</p>

<p>In order to build or manage organizations, we must work, as the wizard put it, <i>with</i> nature, not against it. There are also counter-examples, for example my wife&#39;s grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such &quot;magic,&quot; needless to say, takes constant maintenance; else the spell breaks.</p>

<p>To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.</p>

<p>Now, in more specific terms, what can we realistically expect to teach about computer science?</p>

<p>Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., <a href="http://dbpedia.org/resource/Cache" id="link-id0x13fcc8b8">cache</a>, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.</p>

<p>Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.</p>

<p>Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.</p>

<p>I tried once to tell the <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x72d7490">SPARQL</a> committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the &quot;semanticist&quot; mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.</p>

<p>Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.</p>

<p>
<a href="http://www.larkc.eu/" id="link-id0x1c5591f0">LarKC</a> (EU FP7 Large <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x15fae798">Knowledge</a> Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.</p>

<p>Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-&quot;paradigmatism&quot; given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, <a href="http://dbpedia.org/resource/Information" id="link-id0x6f03e90">information</a> hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.</p>

<p>I was once at a <a href="http://dbpedia.org/resource/Data" id="link-id0x1d524ce0">data</a> integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:</p>

<blockquote>
<i>The edge is created in the &quot;Wild West&quot; — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism&#39;s sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be &quot;driven out o&#39;Dodge.&quot;</i>
</blockquote>

<p>So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.</p>

<p>But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.</p>

<ul>
 <li>
  <p>
    <b>Know when to ontologize, when to folksonomize.</b> The history of standards has examples of &quot;stacks of Babel,&quot; sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, <a href="http://dbpedia.org/resource/Tag" id="link-id0x1dbd9018">tag</a> folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.</p>
 </li>

<li>
  <p>
    <b>Answer only questions that are actually asked.</b> This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.</p>

<p>The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.</p>
</li>

<li>
  <p>
    <b>Deal with ambiguity.</b> Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.</p>
</li>
</ul>

<p>Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.</p>

<p>So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?</p>

<ul>
 <li>
  <p>
    <b>Data integration.</b> Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x2038d520">semantic web</a> community simply has to go.</p>
 </li>

<li>
  <p>
    <b>Design and implement workflows for content extraction,</b> e.g., <a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x713cdc0">NLP</a> or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.</p>
</li>

<li>
  <p>
    <b>Design SOA workflows.</b> The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.</p>
</li>

<li>
  <p>
    <b>Lightweight knowledge engineering.</b> The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.</p>
</li>

<li>
  <p>
    <b>Understand information quality</b> in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.</p>
</li>
</ul>

<p>Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.</p>

<p>Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.</p>

<p>The semanticists I have met are more of the scholar than the IT consultant profile. I say <i>semanticist</i> for the semantic web research people and <i>semantician</i> for the practitioner we are trying to define.</p>

<p>We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.</p>

<p>If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.</p>

<p>The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.</p>

<p>
<b>Related</b>
</p>
<ul>
 <li>
  <a href="http://virtuoso.openlinksw.com/presentations/Linked_Data_Virtualization/Linked_Data_Virtualization.html" id="link-id0x199aca78">Linked Data Driven Data Virtualization for Web-scale Integration (presentation)</a>
 </li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1603" id="link-id0x13297a70">Linked Data and Virtuoso in 2010</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1595" id="link-id0x1a3d0bd0">Getting The Linked Data Value Pyramid Layers Right</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1572" id="link-id0x1802b170">Provenance and Reification in Virtuoso</a>
</li>
<li>
  <a href="http://www.openlinksw.com/blog/~kidehen/?id=1519" id="link-id0x19af4220">The Time for RDBMS Primacy Downgrade is Nigh!</a>
</li>
<li>
  <a href="http://www.openlinksw.com/weblog/oerling/?id=1375" id="link-id0x1a07a378">Aspects of RDF to RDF Mapping</a>
</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-04-02#1617">
  <rss:title>Upcoming RDF Loader in Unclustered Virtuoso loads Uniprot at 279 Ktriples/s!</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-04-02T14:15:01Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We recently heard that Oracle 11G loaded RDF faster than we did. Now, we never thought the speed of loading a database was as important as the speed of query results, but since this is the sole area where they have reportedly been tested as faster, we decided it was time loading was addressed. Indeed, without Oracle to challenge us on query performance, we would not be half as good as we are. So, spurred on by the Oracular influence, we did something about our RDF loading. Performance, I have said before, is a matter of locality and parallelism. So we applied both to the otherwise quite boring exercise of loading RDF. The recipe is this: Take a large set of triples; resolve the IRIs and literals into their IDs; then insert each index of the triple table on its own thread. All the lookups and inserts are first sorted in key order to get the locality. Running the indices in parallel gets the parallelism. Then run the parser on its own thread, fetching chunks of consecutive triples and queueing them for a pool of loader threads. Then run several parsers concurrently on different files so as to make sure there is work enough at all times. Do not make many more process threads than available CPU threads, since they would just get in each other&#39;s way. The whole process is non-transactional, starting from a checkpoint and ending with a checkpoint. The test system was a dual-Xeon 5520 with 72G RAM. The Virtuoso was a single server; no cluster capability was used. We loaded English Dbpedia, 179M triples, in 15 minutes, for a rate of 198 Kt/s. Uniprot with 1.33 G triples loaded in 79 minutes, for 279 Kt/s. The source files were the Dbpedia 3.4 English files and the Bio2RDF copy of Uniprot, both in Turtle syntax. The uniref, uniparc and uniprot files from the Bio2RDF set were sliced into smaller chunks so as to have more files to load in parallel; the taxonomy file was as such; and no other Bio2RDF files were loaded. Both experiments ran with 8 load streams, 1 per core. The CPU utilization was mostly between 1400% and 1500%, 14-15 of 16 CPU threads busy. Top load speed for a measurement window of 2 minutes was 383 Kt/s. The index scheme for RDF quads was the default Virtuoso 6 configuration of 5 indices — GS, SP, OP, PSOG, and POGS. (We call this &quot;3+2&quot; indexing, because there are 3 partial and 2 full indices, delivering massive performance benefits over most other index schemes.) IRIs and literals reside in their own tables, each indexed from string to ID and vice versa. A full-text index on literals was not used. Compared to previous performance, we have more than tripled our best single server multi-stream load speed, and multiplied our single stream load speed by a factor of 8. Some further gains may be reached by adjusting thread counts and matching vector sizes to CPU cache. This will be available in a forthcoming release; this is not for download yet. Now that you know this, you may guess what we are doing with queries. More on this another time.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We recently heard that <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x21414c58">Oracle</a> 11G loaded <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x28281e50">RDF</a> faster than we did. Now, we never thought the speed of loading a database was as important as the speed of query results, but since this is the <b><i>sole</i></b> area where they have reportedly been tested as faster, we decided it was time loading was addressed. Indeed, without Oracle to challenge us on query performance, we would not be half as good as we are. So, spurred on by the Oracular influence, we did something about our RDF loading.</p>

<p>Performance, I have said before, is a matter of locality and parallelism.  So we applied both to the otherwise quite boring exercise of loading RDF.  The recipe is this: Take a large set of triples; resolve the IRIs and literals into their IDs; then insert each index of the triple table on its own thread.  All the lookups and inserts are first sorted in key order to get the locality.  Running the indices in parallel gets the parallelism.  Then run the parser on its own thread, fetching chunks of consecutive triples and queueing them for a pool of loader threads.  Then run several parsers concurrently on different files so as to make sure there is work enough at all times.  Do not make many more process threads than available <a href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x30f3b20">CPU</a> threads, since they would just get in each other&#39;s way.</p>

<p>The whole process is non-transactional, starting from a checkpoint and ending with a checkpoint.</p>

<p>The test system was a dual-Xeon 5520 with 72G RAM.  The <a href="http://virtuoso.openlinksw.com" id="link-id0x3256138">Virtuoso</a> was a single server; no cluster capability was used.</p>

<p>We loaded English <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x33b3e58">Dbpedia</a>, 179M triples,  in 15 minutes, for a rate of 198 Kt/s. Uniprot with 1.33 G triples loaded in 79 minutes, for 279 Kt/s.</p>

<p>The source files were the Dbpedia 3.4 English files and the <a href="http://www.bio2rdf.org/" id="link-id0x28266c20">Bio2RDF</a> copy of Uniprot, both in Turtle syntax.  The uniref, uniparc and uniprot files from the Bio2RDF set were sliced into smaller chunks so as to have more files to load in parallel; the taxonomy file was as such; and no other Bio2RDF files were loaded.  Both experiments ran with 8 load streams, 1 per core.  The CPU utilization was mostly between 1400% and 1500%, 14-15 of 16 CPU threads busy. Top load speed for a measurement window of 2 minutes was 383 Kt/s.</p>

<p>The index scheme for RDF quads was the default Virtuoso 6 configuration of 5 indices — GS, SP, OP, PSOG, and POGS. (We call this &quot;3+2&quot; indexing, because there are 3 partial and 2 full indices, delivering massive performance benefits over most other index schemes.) IRIs and literals reside in their own tables, each indexed from string to ID and vice versa. A full-text index on literals was not used.</p>

<p>Compared to previous performance, we have more than tripled our best single server multi-stream load speed, and multiplied our single stream load speed by a factor of 8. Some further gains may be reached by adjusting thread counts and matching vector sizes to CPU <a href="http://dbpedia.org/resource/Cache" id="link-id0x20403130">cache</a>.</p>

<p>This will be available in a forthcoming release; this is not for download yet.  Now that you know this, you may guess what we are doing with queries.  More on this another time.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-03-15#1615">
  <rss:title>SemData@Sofia Roundtable write-up</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-03-15T14:46:57Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">There was last week an invitation-based roundtable about semantic data management in Sofia, Bulgaria. Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis. Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them? I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences? Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.) OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel. I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF&#39;s run time typing needs. So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community. After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same. So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what&#39;s been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns. Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need. Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself. I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again. Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing à la MonetDB, this too integrates and applies to inference results. So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans. Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today. What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can&#39;t cost 10x more to deploy than the equivalent relational solution. We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale. And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>There was last week an <a href="http://www.semdata.org/" id="link-id11a83cf98">invitation-based roundtable</a> about semantic <a href="http://dbpedia.org/resource/Data" id="link-id0x1d37f598">data</a> management in <a href="http://www.dbpedia.org/resource/Sofia" id="link-id0x1ba4a208">Sofia, Bulgaria</a>.</p>

<p>Lots of smart people together. The meeting was hosted by <a href="http://dbpedia.org/resource/Ontotext" id="link-id0x1cfc83f8">Ontotext</a> and chaired by <a href="http://www.dbpedia.org/resource/Dieter_Fensel" id="link-id0x1dc6e0d0">Dieter Fensel</a>. On the database side we had Ontotext, <a href="http://www.systap.com/" id="link-id0x1cda77f0">SYSTAP</a> (<a href="http://www.systap.com/bigdata.htm" id="link-id0x1dba6a30">Bigdata</a>), <a href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x1d8e1d88">CWI</a> (<a href="http://dbpedia.org/resource/MonetDB" id="link-id0x1d8cbcf0">MonetDB</a>), <a href="http://www.dbpedia.org/resource/Karlsruhe_Institute_of_Technology" id="link-id0x1e204cb0">Karlsruhe Institute of Technology</a> (YARS2/<a href="http://swse.deri.ie/" id="link-id0x1e653bf0">SWSE</a>). <a href="http://www.larkc.eu/" id="link-id0x1e6a4408">LarKC</a> was well represented, being our hosts, with STI, Ontotext, CYC, and <a href="http://www.vu.nl/" id="link-id0x1c8a6090">VU Amsterdam</a>. Notable absences were <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x1e5ab690">Oracle</a>, <a href="http://freebase.com/guid/9202a8c04000641f8000000005c908d6" id="link-id0x1f5e5ff0">Garlik</a>, <a href="http://semanticweb.org/id/Franz_Inc" id="link-id0x1d9c08f0">Franz</a>, and <a href="http://www.talis.com/" id="link-id0x1d338b30">Talis</a>.</p>

<p>Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?</p>

<p>I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?</p>

<p>Michael <a href="http://dbpedia.org/resource/Michael_Stonebraker" id="link-id0x1da55730">Stonebraker</a> and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1d828310">RDF</a> storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)</p>

<p>
<a href="http://www.openlinksw.com/dataspace/organization/openlink#this" id="link-id0x1cf1e620">OpenLink Software</a> and <a href="http://virtuoso.openlinksw.com" id="link-id0x1cfbc1d8">Virtuoso</a> are in agreement with both sides, contradictory as this might sound. We took our <a href="http://dbpedia.org/resource/Relational_database_management_system" id="link-id0x1e1f6a20">RDBMS</a> and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.</p>

<p>I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF&#39;s run time typing needs.</p>

<p>So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.</p>

<p>After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.</p>

<p>So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what&#39;s been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.</p>

<p>Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.</p>

<p>Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.</p>

<p>I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.</p>

<p>Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you <i>will</i> have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing <i>à la</i> MonetDB, this too integrates and applies to inference results.</p>


<p>So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.</p>


<p>Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.</p>

<p>What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can&#39;t cost 10x more to deploy than the equivalent relational solution.</p>

<p>We also want to tell the key-value, anti-<a href="http://dbpedia.org/resource/SQL" id="link-id0x172e8c80">SQL</a> people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of <a href="http://dbpedia.org/resource/ACID" id="link-id0x1e0de2e8">ACID</a>, at least consistent read, availability, complex query, large scale.</p>

<p>And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a <a href="http://www.tpc.org/" id="link-id0x1e3cb130">TPC</a>-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold. </p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-02-12#1607">
  <rss:title>Compare &amp; Contrast: SQL Server&#39;s Linked Server vs Virtuoso&#39;s Virtual Database Layer</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-02-12T21:44:10Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Microsoft SQL Server&#39;s Linked Server Promise The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers. The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all data access providers, and even for those that do, results received via different methods may differ. Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as &quot;ad-hoc distributed queries&quot;. Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them. The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end&#39;s abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end&#39;s dialect. This leads to problems especially when the Linked Server is an older variant which doesn&#39;t support SQL-92 (e.g., Progress 8.x or earlier, Informix 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, MySQL, etc.). Basic Four-Part Naming SELECT *   FROM linked_server.[catalog].[schema].object Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features. OpenQuery SELECT *   FROM OPENQUERY ( linked_server , &#39;query&#39; ) OpenQuery also presumes that you have pre-defined a Linked Server, but executes the query as a &quot;pass-through&quot;, handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them. From the product docs: SQL Server&#39;s Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source. OPENQUERY can be referenced in the FROM clause of a query as if it were a table name. OPENQUERY can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets, OPENQUERY returns only the first one. ... OPENQUERY does not accept variables for its arguments. OPENQUERY cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name. OpenRowset SELECT *   FROM OPENROWSET     ( &#39;provider_name&#39; ,       &#39;datasource&#39; ; &#39;user_id&#39; ; &#39;password&#39;,       { [ catalog. ] [ schema. ] object | &#39;query&#39; }    ) OpenRowset does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both &quot;pass-through&quot; and &quot;local execution&quot; queries, which can lead to confusion when the results differ (as they regularly will). More from product docs: Includes all connection information that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The OPENROWSET function can be referenced in the FROM clause of a query as if it were a table name. The OPENROWSET function can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets, OPENROWSET returns only the first one. OPENROWSET also supports bulk operations through a built-in BULK provider that enables data from a file to be read and returned as a rowset. ... OPENROWSET can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation. Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form schema.object must be specified. If the provider supports only catalog names, a three-part name of the form catalog.schema.object must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL). OPENROWSET does not accept variables for its arguments. OpenDataSource SELECT *   FROM OPENDATASOURCE    ( &#39;provider_name&#39;,      &#39;provider_specific_datasource_specification&#39;    ).[catalog].[schema].object As with basic four-part naming, OpenDataSource executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features. Additional doc excerpts Provides ad hoc connection information as part of a four-part object name without using a linked server name. ... OPENDATASOURCE can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. The OPENDATASOURCE function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore, OPENDATASOURCE can be used as the first part of a four-part name that refers to a table or view name in a SELECT, INSERT, UPDATE, or DELETE statement, or to a remote stored procedure in an EXECUTE statement. When executing remote stored procedures, OPENDATASOURCE should refer to another instance of SQL Server. OPENDATASOURCE does not accept variables for its arguments. Like the OPENROWSET function, OPENDATASOURCE should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. Neither OPENDATASOURCE nor OPENROWSET provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time that OPENDATASOURCE is called. Virtuoso&#39;s Virtual Database Promise &amp; Deliverables The ability to link objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources. There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views. All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h2>
<a href="http://dbpedia.org/resource/Microsoft" id="link-id166785f0">Microsoft</a> <a href="http://dbpedia.org/resource/SQL" id="link-id169b6bb8">SQL</a> <a href="http://dbpedia.org/resource/Microsoft_SQL_Server" id="link-id163b8350">Server</a>&#39;s Linked Server Promise</h2>
<p>The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers.</p>
<p>The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all <a href="http://dbpedia.org/resource/Data" id="link-id1675e128">data</a> access providers, and even for those that do, results received via different methods may differ.</p>
<p>Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as &quot;ad-hoc distributed queries&quot;.</p>
<p>Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them.</p>
<p>The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end&#39;s abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end&#39;s dialect. This leads to problems especially when the Linked Server is an older variant which doesn&#39;t support SQL-92 (e.g., Progress 8.x or earlier, <a href="http://dbpedia.org/resource/IBM_Informix" id="link-id167f6fa0">Informix</a> 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, <a href="http://dbpedia.org/resource/MySQL" id="link-id166c7848">MySQL</a>, etc.).</p>
<h3>Basic Four-Part Naming</h3>
<blockquote>
<code>SELECT * <br />  FROM linked_server.[catalog].[<a href="http://dbpedia.org/resource/Database_schema" id="link-id163c3f78">schema</a>].object</code>
</blockquote>
<p>Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.</p>
<h3>OpenQuery</h3>
<blockquote>
<code>SELECT * <br />  FROM OPENQUERY ( linked_server , &#39;query&#39; )</code>
</blockquote>
<p>OpenQuery also presumes that you have pre-defined a Linked Server, but executes the query as a &quot;pass-through&quot;, handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them.</p>
<h4>From the product docs:</h4>
<blockquote>
<p>
<i>SQL Server&#39;s Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source. <code>OPENQUERY</code> can be referenced in the <code>FROM</code> clause of a query as if it were a table name. <code>OPENQUERY</code> can also be referenced as the target table of an <code>INSERT</code>, <code>UPDATE</code>, or <code>DELETE</code> statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets, <code>OPENQUERY</code> returns only the first one.</i>
</p>
<p>
<i>...</i>
</p>
<p>
<i><code>OPENQUERY</code> does not accept variables for its arguments. <code>OPENQUERY</code> cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name. </i>
</p>
</blockquote>
<h3>OpenRowset</h3>
<blockquote>
<code>SELECT * 
<br />  FROM OPENROWSET
<br />    ( &#39;provider_name&#39; , <br />      &#39;datasource&#39; ; &#39;user_id&#39; ; &#39;password&#39;, <br />      { [ catalog. ] [ schema. ] object | &#39;query&#39; }<br />    )</code>
</blockquote>
<p>
<code>OpenRowset</code> does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both &quot;pass-through&quot; and &quot;local execution&quot; queries, which can lead to confusion when the results differ (as they regularly will).</p>
<h4>More from product docs:</h4>
<blockquote>
<p>
<i>Includes all connection <a href="http://dbpedia.org/resource/Information" id="link-id163ab840">information</a> that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The <code>OPENROWSET</code> function can be referenced in the <code>FROM</code> clause of a query as if it were a table name. The <code>OPENROWSET</code> function can also be referenced as the target table of an <code>INSERT</code>, <code>UPDATE</code>, or <code>DELETE</code> statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets, <code>OPENROWSET</code> returns only the first one.</i>
</p>
<p>
<i>OPENROWSET also supports bulk operations through a built-in <code>BULK</code> provider that enables data from a file to be read and returned as a rowset.</i>
</p>
<p>
<i>...</i>
</p>
<p>
<i><code>OPENROWSET</code> can be used to access remote data from OLE DB data sources only when the <code>DisallowAdhocAccess</code> registry option is explicitly set to <code>0</code> for the specified provider, and the <code>Ad Hoc Distributed Queries</code> advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation.</i>
</p>
<p>
<i>Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form <code>schema.object</code> must be specified. If the provider supports only catalog names, a three-part name of the form <code>catalog.schema.object</code> must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL). <code>OPENROWSET</code> does not accept variables for its arguments.</i>
</p>
</blockquote>
<h3>OpenDataSource</h3>
<blockquote>
<code>SELECT * <br />  FROM OPENDATASOURCE<br />    ( &#39;provider_name&#39;,<br />      &#39;provider_specific_datasource_specification&#39;<br />    ).[catalog].[schema].object</code>
</blockquote>
<p>As with basic four-part naming, <code>OpenDataSource</code> executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.</p>
<h4>Additional doc excerpts</h4>
<blockquote>
<p>
<i>Provides ad hoc connection information as part of a four-part object name without using a linked server name.</i>
</p>
<p>
<i>...</i>
</p>
<p>
<i><code>OPENDATASOURCE</code> can be used to access remote data from OLE DB data sources only when the <code>DisallowAdhocAccess</code> registry option is explicitly set to <code>0</code> for the specified provider, and the <code>Ad Hoc Distributed Queries</code> advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.</i>
</p>
<p>
<i>The <code>OPENDATASOURCE</code> function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore, <code>OPENDATASOURCE</code> can be used as the first part of a four-part name that refers to a table or view name in a <code>SELECT</code>, <code>INSERT</code>, <code>UPDATE</code>, or <code>DELETE</code> statement, or to a remote stored procedure in an <code>EXECUTE</code> statement. When executing remote stored procedures, <code>OPENDATASOURCE</code> should refer to another instance of SQL Server. <code>OPENDATASOURCE</code> does not accept variables for its arguments.</i>
</p>
<p>
<i>Like the <code>OPENROWSET</code> function, <code>OPENDATASOURCE</code> should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. Neither <code>OPENDATASOURCE</code> nor <code>OPENROWSET</code> provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time that <code>OPENDATASOURCE</code> is called.</i>
</p>
</blockquote>
<h2>
<a href="http://virtuoso.openlinksw.com" id="link-id122c66b8">Virtuoso</a>&#39;s <a href="http://dbpedia.org/resource/Virtual_Database" id="link-id167af7d8">Virtual Database</a> Promise &amp; Deliverables</h2> 
<p>The ability to link objects (tables, views, stored procedures) from any <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id1394ab90">ODBC</a>-accessible data source. This includes any <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id11c38748">JDBC</a>-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.</p>
<p>There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.</p>
<p>All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.</p>

]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2010-02-12#1606">
  <rss:title>Compare &amp; Contrast: Oracle Heterogeneous Services (HSODBC, DG4ODBC) vs Virtuoso&#39;s Virtual Database Layer</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2010-02-12T21:43:51Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Oracle Gateway Promise Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue SQL queries against any ODBC- or OLE-DB-accessible linked back end. Reality Promise fails to materialize for several reasons. Immediate limitations include: All tables locked by a FOR UPDATE clause and all tables with LONG columns selected by the query must be located in the same external database. Distributed queries cannot select user-defined types or object REF datatypes on remote tables. In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations: A table including a BLOB column must have a separate column that serves as a primary key. BLOB and CLOB data cannot be read by passthrough queries. Updates or deletes that include unsupported functions within a WHERE clause are not allowed. Generic Connectivity does not support stored procedures. Generic Connectivity agents cannot participate in distributed transactions; they support single-site transactions only. Generic Connectivity does not support multithreaded agents. Updating LONG columns with bind variables is not supported. Generic Connectivity does not support ROWIDs. Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function. For instance, when a SELECT COUNT (*) FROM table@link is issued through Oracle SQL, the target DBMS doesn&#39;t simply perform a SELECT COUNT (*) FROM table. Rather, it performs a SELECT * FROM table which is used to inventory all columns in the table, and then performs and fully retrieves SELECT field FROM table into an internal temporary table, where it does the COUNT (*) itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support COUNT (*) (among other functions). Virtuoso&#39;s Virtual Database Comparison The Virtuoso Universal Server will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources. There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views. All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h3>
<a href="http://dbpedia.org/resource/Oracle_Database" id="link-id12349be8">Oracle</a> Gateway Promise</h3>
<p>Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue <a href="http://dbpedia.org/resource/SQL" id="link-id167e5760">SQL</a> queries against any <a href="http://dbpedia.org/resource/Open_Database_Connectivity" id="link-id13c6bfa0">ODBC</a>- or OLE-DB-accessible linked back end.</p>
<h3>Reality</h3>
<p>Promise fails to materialize for several reasons. Immediate limitations include:</p>
<ul>
<li>All tables locked by a <code>FOR UPDATE</code> clause and all tables with <code>LONG</code> columns selected by the query must be located in the same external database.</li>
<li>Distributed queries cannot select user-defined types or object <code>REF</code> datatypes on remote tables.</li>
</ul>
<p>In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations:</p>
<ul>
<li>A table including a <code>BLOB</code> column must have a separate column that serves as a primary key.</li>
<li>
  <code>BLOB</code> and <code>CLOB</code> <a href="http://dbpedia.org/resource/Data" id="link-id163e07f0">data</a> cannot be read by passthrough queries.</li>
<li>Updates or deletes that include unsupported functions within a <code>WHERE</code> clause are not allowed.</li>
<li>Generic Connectivity does not support stored procedures.</li>
<li>Generic Connectivity agents cannot participate in distributed transactions; they support single-site transactions only.</li>
<li>Generic Connectivity does not support multithreaded agents.</li>
<li>Updating <code>LONG</code> columns with bind variables is not supported.</li>
<li>Generic Connectivity does not support <code>ROWID</code>s.</li>
</ul>
<p>Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function.</p>
<p>For instance, when a <code>SELECT COUNT (*) FROM table@link</code> is issued through Oracle SQL, the target DBMS doesn&#39;t simply perform a <code>SELECT COUNT (*) FROM table</code>.  Rather, it performs a <code>SELECT * FROM table</code> which is used to inventory all columns in the table, and then performs and fully retrieves <code>SELECT field FROM table</code> into an internal temporary table, where it does the <code>COUNT (*)</code> itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support <code>COUNT (*)</code> (among other functions).</p>
<h3>
<a href="http://virtuoso.openlinksw.com" id="link-id16814bd8">Virtuoso</a>&#39;s <a href="http://dbpedia.org/resource/Virtual_Database" id="link-id1185b9d0">Virtual Database</a> Comparison</h3>
<p>The Virtuoso <a href="http://dbpedia.org/resource/Virtuoso_Universal_Server" id="link-id1666f658">Universal Server</a> will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any <a href="http://dbpedia.org/resource/Java_Database_Connectivity" id="link-id1668aec8">JDBC</a>-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.</p>
<p>There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.</p>
<p>All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local <a href="http://dbpedia.org/resource/Database_schema" id="link-id1628c438">schema</a>.</p>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-11-11#1588">
  <rss:title>RDF Geography With Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-11-11T17:17:27Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have just added a geometry data type and corresponding R-tree index to Virtuoso. This follows the general scheme of SQL/MM, as is implemented by PostGIS and many others. We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins. We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes. The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this: PREFIX geo: &lt;http://www.w3.org/2003/01/geo/wgs84_pos#&gt; SELECT ?class COUNT (*) WHERE { ?m geo:geometry ?geo . ?m a ?class . FILTER ( &lt;bif:st_intersects&gt; ( ?geo, &lt;bif:st_point&gt; (0, 52), 100 ) ) } GROUP BY ?class ORDER BY DESC 2 This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London. For any data set with WGS 84 geo:long and geo:lat values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF. Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them. For scalability, we tried the implementation with OpenStreetMap&#39;s 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object&#39;s key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven. We can do spatial joins like — SELECT ?s ( &lt;sql:num_or_null&gt; (?p) ) COUNT (*) WHERE { ?s &lt;http://dbpedia.org/ontology/populationTotal&gt; ?p . FILTER ( &lt;sql:num_or_null&gt; (?p) &gt; 1000000 ) . ?s geo:geometry ?geo . FILTER ( &lt;bif:st_intersects&gt; ( ?pt, ?geo, 5 ) ) . ?xx geo:geometry ?pt } GROUP BY ?s ( &lt;sql:num_or_null&gt; (?p) ) ORDER BY DESC 3 LIMIT 20 This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result: http://dbpedia.org/resource/Munich 1356594 117280 http://dbpedia.org/resource/London 7355400 81486 http://dbpedia.org/resource/Davao_City 1363337 58640 http://dbpedia.org/resource/Belo_Horizonte 2412937 58640 http://dbpedia.org/resource/Chengde 3610000 58640 http://dbpedia.org/resource/Hamburg 1769117 51664 http://dbpedia.org/resource/San_Diego%2C_California 1266731 47685 http://dbpedia.org/resource/Bursa 1562828 47685 http://dbpedia.org/resource/Port-au-Prince 1082800 47685 http://dbpedia.org/resource/Oakland_County%2C_Michigan 1194156 45636 http://dbpedia.org/resource/Sana%27a 1747627 40923 http://dbpedia.org/resource/Milan 1303437 40923 http://dbpedia.org/resource/Campinas 1059420 40923 http://dbpedia.org/resource/Hohhot 2580000 40923 http://dbpedia.org/resource/Brussels 1031215 40923 http://dbpedia.org/resource/Bogra_District 2988567 40923 http://dbpedia.org/resource/Cort%C3%A9s_Department 1202510 40923 http://dbpedia.org/resource/Berlin 3416300 35668 http://dbpedia.org/resource/New_York_City 8274527 30810 http://dbpedia.org/resource/Los_Angeles%2C_California 3849378 25614 20 Rows. -- 1733 msec. Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s 664% cpu 2% read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous. We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries. For more about SQL/MM, you can look to a couple of PDFs: SQL/MM Spatial: The Standard to Manage Spatial Data in Relational Database Systems by Knut Stolze SQL Multimedia and Application Packages (SQL/MM) by Jim Melton and Andrew Eisenberg</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have just added a geometry <a href="http://dbpedia.org/resource/Data" id="link-id0x1c0e02b0">data</a> type and corresponding <a href="http://dbpedia.org/resource/R-tree" id="link-id0x1e093220">R</a>-tree index to <a href="http://virtuoso.openlinksw.com" id="link-id0x1ddccfe8">Virtuoso</a>.  This follows the general scheme of <a href="http://dbpedia.org/resource/SQL" id="link-id0x1b88a580">SQL</a>/MM, as is implemented by <a href="http://dbpedia.org/resource/PostGIS" id="link-id0x1d271a90">PostGIS</a> and many others.  We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins.  We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes.</p>

<p>The geometry support is for both SQL and <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1b8d4ca8">SPARQL</a>.  On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1ed69318">RDF</a>, a geometry can occur as the object of a quad.  If the object is a typed-literal of the <code>virtrdf:Geometry</code> type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed.  After this, SQL MM predicates and functions can be used with SPARQL, like this:</p>

<blockquote>
 <pre><code>  PREFIX  geo:  &lt;<a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x1d2d0ae0">http</a>://www.w3.org/2003/01/geo/wgs84_pos#&gt;  
  SELECT  ?class
          COUNT (*) 
   WHERE  { ?m  geo:geometry  ?geo    . 
            ?m  a             ?class  . 
                FILTER ( &lt;bif:st_intersects&gt; 
                          ( ?geo, 
                            &lt;bif:st_point&gt; (0, 52), 
                            100
                          )
                       )
          } 
GROUP BY  ?class 
ORDER BY  DESC 2 </code>
 </pre></blockquote>


<p>This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.</p>

<p>For any data set with <a href="http://dbpedia.org/resource/World_Geodetic_System" id="link-id0x1ec00578">WGS 84</a> <code>geo:long</code> and <code>geo:lat</code> values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the <code>geo:geometry</code> property of the subject with the long/lat.  This then enables fast spatial access to arbitrary location data in RDF.</p>

<p>Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities.  As these get adopted we will support them.</p>

<p>For scalability, we tried the implementation with <a href="http://www.openstreetmap.org/" id="link-id0x1c781e68">OpenStreetMap</a>&#39;s 350 million or so points.  The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object&#39;s key, thus not by range of coordinates or such.  Like this, the items are evenly spread even though the coordinate distribution is highly uneven.</p>

<p>We can do spatial joins like —</p>

<blockquote>
 <pre><code>   SELECT  ?s 
           ( &lt;sql:num_or_null&gt; (?p) )  
           COUNT (*) 
    WHERE  { ?s   &lt;http://<a href="http://dbpedia.org/resource/DBpedia" id="link-id0x1f885868">dbpedia</a>.org/ontology/populationTotal&gt;  ?p    . 
             FILTER 
               ( &lt;sql:num_or_null&gt; (?p) &gt; 1000000 )                      . 
             ?s   geo:geometry                                   ?geo  .
             FILTER 
               ( &lt;bif:st_intersects&gt; ( ?pt, ?geo, 5 ) )                  . 
             ?xx  geo:geometry                                   ?pt 
           } 
 GROUP BY  ?s 
           ( &lt;sql:num_or_null&gt; (?p) )
 ORDER BY  DESC 3 
    LIMIT  20 </code> </pre></blockquote>

<p>This takes the DBpedia subjects that have a population over 1 million and a geometry.  We then count all the geometries within 5 km of the point location of the first geometry.  With DBpedia (about 5 million points), <a href="http://www.geonames.org/" id="link-id0x1d4279b0">GeoNames</a> (7 million points), and OpenStreetMap (350 million points), we get the result:</p>

<blockquote>
 <pre><code>http://dbpedia.org/resource/Munich                        1356594    117280
http://dbpedia.org/resource/London                        7355400     81486
http://dbpedia.org/resource/Davao_City                    1363337     58640
http://dbpedia.org/resource/Belo_Horizonte                2412937     58640
http://dbpedia.org/resource/Chengde                       3610000     58640
http://dbpedia.org/resource/Hamburg                       1769117     51664
http://dbpedia.org/resource/San_Diego%2C_California       1266731     47685
http://dbpedia.org/resource/Bursa                         1562828     47685
http://dbpedia.org/resource/Port-au-Prince                1082800     47685
http://dbpedia.org/resource/Oakland_County%2C_Michigan    1194156     45636
http://dbpedia.org/resource/Sana%27a                      1747627     40923
http://dbpedia.org/resource/Milan                         1303437     40923
http://dbpedia.org/resource/Campinas                      1059420     40923
http://dbpedia.org/resource/Hohhot                        2580000     40923
http://dbpedia.org/resource/Brussels                      1031215     40923
http://dbpedia.org/resource/Bogra_District                2988567     40923
http://dbpedia.org/resource/Cort%C3%A9s_Department        1202510     40923
http://dbpedia.org/resource/Berlin                        3416300     35668
http://dbpedia.org/resource/New_York_City                 8274527     30810
http://dbpedia.org/resource/Los_Angeles%2C_California     3849378     25614<br />
20 Rows. -- 1733 msec.<br />
Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s  664% <a href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x1e6403b0">cpu</a> 2%  read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs
</code></pre></blockquote>

<p>This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm <a href="http://dbpedia.org/resource/Cache" id="link-id0x1e81f610">cache</a>.  Fair enough for a first crack, this can obviously be optimized further.  Still, the geo part of the processing is already as good as instantaneous.</p>

<p>We will shortly have the geography features installed on DBpedia and the other data sets we host.  As these come online we will show more demo queries.</p>

<p>For more about SQL/MM, you can look to a couple of PDFs:</p>
<ul>
<li>
<a href="http://www.fer.hr/_download/repository/SQLMM_Spatial-_The_Standard_to_Manage_Spatial_Data_in_Relational_Database_Systems.pdf" id="link-id133775f0">SQL/MM Spatial: The Standard to Manage Spatial Data in
Relational Database Systems</a> by Knut Stolze</li>
<li>
  <a href="http://www.sigmod.org/record/issues/0112/standards.pdf" id="link-id1433c5e0">SQL Multimedia and Application Packages (SQL/MM)</a> by Jim Melton and Andrew Eisenberg</li>
</ul>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-10-27#1586">
  <rss:title>European Commission and the Data Overflow</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-10-27T18:29:51Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data. Since the questionnaire is public, I am publishing my answers below. Data and data types What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news. This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema. Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like. The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction. Relevant sections of this mass of data are a potential addition to any present or future analytics application. Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth. Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data. This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data. Data and metadata should travel together but may have different data models. By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible. Restricted circles can and likely will implement similar ideas. What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or knowledge graphs, 3D, sensor streams...)? All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know. Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred. Interleaving of all database functions and types becomes increasingly important. Industries, communities Who is producing these data and why? Could they do it better? How? Right now, projects such as Bio2RDF, Neurocommons, and DBPedia produce this data. The processes are in place and are reasonable. Incremental improvement is to be expected. These processes, along with the linked data meme generally taking off, drive demand for better NLP (Natural Language Processing), e.g., entity and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs). Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this. The required baseline level has been reached; the rest is a matter of automating deployment. Within the enterprise, there are advantages to be gained for information integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a URI. Some of this information may even be published on an extranet for self-service and web-service interfaces. This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier. Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread. Who is consuming these data and why? Could they do it better? How? Consumers are various. The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available. Consuming the data is hindered by the user not even necessarily knowing what data there is. This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with SQL report generators and statistics packages. Where Web 2.0 made the citizen journalist, the web of linked data will make the citizen analyst. For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful. We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean. What industrial sectors in Europe could become more competitive if they became much better at managing data? Any sector could benefit. Early adopters are seen in the biomedical field and to an extent in media. Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support? The regulation landscape drives database demand through data retention requirements and the like. With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online. Regulation is needed to protect individuals, but integration should still be possible for science. For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF. This is possible but needs some more work. Also, creating on-the-fly-anonymizing views on data might help. More research is needed for reconciling the need for security with the advantages of broad-based ad hoc integration. Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile. This is a tall order and implementing something of the sort is an open question. What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers. We have come across the following: Knowing that the data exists in the first place. If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like. Compatible subject matter but incompatible representation: For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument. It is only to be expected that the time interval between measurements is not the same. So there is need for a lot of one-off programming to align data. Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network. Computation needs to go to the data, and databases need to support this. Services, software stacks, protocols, standards, benchmarks What combinations of components are needed to deal with these problems? Recent times have seen a proliferation of special purpose databases. Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility. We see some of this in integration of map-reduce and scale-out databases. The former antagonists have become partners. Vertica, Greenplum, and OpenLink Virtuoso are example of DBMS featuring work in this direction. Interoperability and at least de facto standards in ways of doing this will emerge. What data exchange and processing mechanisms will be needed to work across platforms and programming languages? HTTP, XML, and RDF are in fact very verbose, yet these are the formats and models that have uptake. Thus, these will continue to be used even though one might think binary formats to be more efficient. There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF. For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue. Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate. What data environments are today so wastefully messy that they would benefit from the development of standards? RDF and OWL are not messy but they could use some more performance; we are working on this. SPARQL is finally acquiring the capabilities of a serious query language, so things are slowly coming together. Community process for developing application domain specific vocabularies works quite well, even though one could argue it is ad hoc and not up to what a modeling purist might wish. Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example. What kind of performance is expected or required of these systems? Who will measure it reliably? How? Relational databases have a history of substantial investment in optimization and some of them are very good for what they do, e.g., the newer generation of analytics databases. The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need. These trends will merge: Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing. We find RDF augmented with some binary types at this crossroads. This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model. The added cost of schema-last and inference must come down. We are working on this. Research work such as carried out with MonetDB gives clues as to how these aims can be reached. The separation of query language and inference is artificial. After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction. Benchmarks are key. Some gain can be had even from repurposing standard relational benchmarks like TPC-H. But the TPC-H rules do not allow official reporting of such. Development of benchmarks for RDF, complex queries, and inference is needed. A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity. A key-value store benchmark might also be conceived. A transaction benchmark like TPC-C might be the basis, maybe augmented with massive user-generated content like reviews and blogs. If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate. The threshold to publish should be lowered: Everybody runs the TPC workloads internally but few publish. Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government. Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction. Benchmarks should be run by software vendors on their own systems, tuned by themselves. But there should be a process of disclosure and auditing; the TPC rules give an example. Compliance should not be too expensive or time consuming. Some community development for automating these things would be a worthwhile target for EC funding. Usability and training How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier? In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL. For the linked data web, the same will take place behind SPARQL. Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult. The casual amateur is hereby warned. There is no single solution. For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches. Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea. For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities. For shipping functions in a cluster or cloud, the BOOM (Berkeley Orders Of Magnitude) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce. The question is whether a PHP developer can be made to do logic programming. This bridge will be crossed only with actual need and even then reluctantly. We may look at the Web 2.0 practice of sharding MySQL, inconvenient as this may be, for an example. There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, post hoc, often a point solution. One could argue that planning ahead would be smarter but by and large the world does not work so. One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce. If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more. This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this. Therefore we wish to go for bold new application types for which the client-server database application is not the model. Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there. These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer. How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries? For the most part, developers do not learn things for the sake of learning. When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction. The research world is often similarly insular. A new inflection in the application landscape is needed to drive learning. This inflection is provided by the ubiquity of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors. RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML. These new things should, within possibility, be deployed in the usual technology stack, LAMP or Java. Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these. A lot of the semantic web potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries. For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize. The question is one of providing challenges. Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training. With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable. As the data overflow proceeds, its victims will multiply and create demand for solutions. The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off. If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT. This would create interest, and interest would drive training and dissemination. The problem is creating the pull. Challenges What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, Google Lunar X Prize, etc. ... ? The EC itself no doubt suffers from data overflow in one function or another. Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start. The more real the data, the better — reality is consistently more complex and surprising than imagination. Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges. Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight. The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact. The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded. Otherwise investing in existing business development will be more interesting to industry. Some industry participation seems necessary; we would wish academia and industry to work closer. Also, having industry supply the baseline guarantees that academia actually does further the state of the art. This is not always certain. If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia. Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed. What should one do to set up such a challenge, administer, and monitor it? The EC should probably circulate a call for actual problem scenarios involving big data. If the matter of the overflow is as dire as represented, cases should be easy to find. A few should be selected and then anonymized if needed. The party with the use case would benefit by having hopefully the best work on it. The contestants would benefit from having real world needs guide R&amp;D. The EC would not have to do very much, except possibly use some money for funding the best proposals. The winner would possibly get a large account and related sales and service income. The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US. There may be a good benchmark at the time, possibly resulting from FP7 itself. In such a case, the EC could offer a prize for winners. Details would have to be worked out case by case. Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress. Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big <a href="http://dbpedia.org/resource/Data" id="link-id0x43bae00">data</a>.</p>
 
<p>Since the <a href="http://cordis.europa.eu/fp7/ict/content-knowledge/consultation_en.html" id="link-id1191c0f8">questionnaire is public</a>, I am publishing my answers below.</p>

<ol type="1" start="1">
<li>
  <p>
    <b>Data and data types</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015? </b>
    </p>

<p>Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional.  This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x5c7add0">RDF</a> and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x5c7adb8">linked data</a> principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data.  There is convergence around <a href="http://dbpedia.org/resource/DBpedia" id="link-id0x5c7ada0">DBpedia</a> identifiers for real-world entities, e.g., most things that would be in the news.</p>

<p>This also means that internal data processes and silos may be enriched with this content.  There is consequent pressure for accommodating more diversity of data, with more flexible <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x7d87a88">schema</a>.</p>

<p>Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data.  Examples are product catalogs, price lists, event schedules  and the like.</p>

<p>The volume of the well known linked data sets is around 10 billion statements.  With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable,  This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.</p>

<p>Relevant sections of this mass of data are a potential addition to any present or future analytics application.</p>

<p>Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data.  This will drive database innovation for the next years even more than the continued classical warehouse growth.</p>

<p>Science data is another driver of the data overflow.  For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce large volumes of data.  This data has highly regular structure but labeling this data with source and lineage calls for a flexible, schema-last, self-describing model, such as RDF and linked data.  Data and <a href="http://dbpedia.org/resource/Metadata" id="link-id0x7a3fb40">metadata</a> should travel together but may have different data models.</p>

<p>By and large, the metadata of science data will be another stream to the web of linked data, at least to the degree it is publicly accessible.  Restricted circles can and likely will implement similar ideas.</p>
    </li>

<li>
    <p>
        <b>What types of data can we deal with intelligently due to their inherent structure (geospatial, temporal, social or <a href="http://dbpedia.org/resource/Knowledge" id="link-id0x5a48058">knowledge</a> graphs, 3D, sensor streams...)?</b>
    </p>

<p>All the above types should be supported inside one DBMS so as to allow efficient querying combining conditions on all these types of data, e.g., <i>photos of sunsets taken last summer in Ibiza, with over 20 megapixels, by people I know.</i>
      </p>

<p>Note that the test for being a sunset is an operation on the image blob that should be taken to the data; the images cannot be economically transferred.</p>

<p>Interleaving of all database functions and types becomes increasingly important.</p>
</li>
  </ol>
</li>


<li>
  <p>
    <b>Industries, communities</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>Who is producing these data and why? Could they do it better? How?</b>
    </p>

<p>Right now, projects such as <a href="http://www.bio2rdf.org/" id="link-id0x2a29de8">Bio2RDF</a>, <a href="http://neurocommons.org/page/Main_Page" id="link-id0x7ddaed0">Neurocommons</a>, and DBPedia produce this data.  The processes are in place and are reasonable.  Incremental improvement is to be expected.  These processes, along with the <a href="http://www.w3.org/DesignIssues/LinkedData.html" id="link-id0xbab4dfd0">linked data meme</a> generally taking off, drive demand for better <a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x51f4e0">NLP</a> (<a href="http://dbpedia.org/resource/Natural_language_processing" id="link-id0x51a1b48">Natural Language Processing</a>), e.g., <a href="http://dbpedia.org/resource/Entity" id="link-id0x956680">entity</a> and relationship extraction, especially extraction that can produce instance data in given ontologies (e.g., events) using common identifiers (e.g., DBPedia URIs).</p>

<p>Mapping of RDBs to RDF is possible, and a W3C working group is developing standards for this.  The required baseline level has been reached; the rest is a matter of automating deployment.  Within the enterprise, there are advantages to be gained for <a href="http://dbpedia.org/resource/Information" id="link-id0x7da9e80">information</a> integration; e.g., all entities in the CRM space can be integrated with all email and support tickets through giving everything a <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x71673f8">URI</a>.  Some of this information may even be published on an <a href="http://dbpedia.org/resource/Extranet" id="link-id0x9aa6e0">extranet</a> for self-service and web-service interfaces.  This has been done at small scales and the rest is a matter of spreading adoption and lowering the entry barrier.  Incremental progress will take place, eventually resulting in qualitatively better integration along the value chain when adoption is sufficiently widespread.</p>

</li>
	<li>
    <p>
        <b>Who is consuming these data and why? Could they do it better? How?</b>
    </p>

<p>Consumers are various.  The greatest need is for tools that summarize complex data and allow getting a bird&#39;s eye view of what data is in the first instance available.  Consuming the data is hindered by the user not even necessarily knowing what data there is.  This is somewhat new, as traditionally the business analyst did know the schema of the warehouse and was proficient with <a href="http://dbpedia.org/resource/SQL" id="link-id0x7f7b148">SQL</a> report generators and statistics packages.</p>

<p>Where Web 2.0 made the <i>citizen journalist</i>, the web of linked data will make the <i>citizen analyst</i>.  For this to happen, with benefits for individuals, enterprises, and governments alike, more work in user interfaces, knowledge discovery, and query composition will be useful.  We may envision a &quot;meshup economy&quot; where data is plentiful, but the unit of value and exchange is the smart report that crystallizes actionable value from this ocean.</p>

</li>
	<li>
    <p>
        <b>What industrial sectors in Europe could become more competitive if they became much better at managing data?</b>
    </p>

<p>Any sector could benefit.  Early adopters are seen in the biomedical field and to an extent in media.  </p>

</li>
	<li>
    <p>
        <b>Is the regulation landscape imposing constraints (privacy, compliance ...) that don&#39;t have today good tool support?</b>
    </p>

<p>The regulation landscape drives database demand through data retention requirements and the like.</p>

<p>With data integration, especially with privacy-sensitive data (as in medicine), there are issues of whether one dares put otherwise-shareable information online.   Regulation is needed to protect individuals, but integration should still be possible for science.</p>

<p>For this, we see a need for progress in applying policy-based approaches (e.g., row level security) to relatively schema-last data such as RDF.  This is possible but needs some more work.  Also, creating on-the-fly-anonymizing views on data might help.</p>

<p>More research is needed for reconciling the need for security with the advantages of broad-based <i>ad hoc</i> integration.  Ideally, data should be intelligent, aware of its origins and classification and cautious of whom it interacts with, all of this supported under the covers so that the user could ask anything but the data might refuse to answer or might restrict answers according to the user&#39;s profile.  This is a tall order and implementing something of the sort is an open question.</p>


</li>
	<li>
    <p>
        <b>What are the main practical problem identified for individuals and organizations? Please give examples and tell us about the main obstacles and barriers.</b>
    </p>

<p>We have come across the following:</p>

<ul>
        <li>Knowing that the data exists in the first place.</li>
<li>If the data is found, figuring out the provenance, units and precision of measurement, identifiers, and the like.</li>
<li>Compatible subject matter but incompatible representation:  For example, one has numbers on a map with different maps for different points in time; another has time series of instrument data with geo-location for the instrument.  It is only to be expected that the time interval between measurements is not the same.  So there is need for a lot of one-off programming to align data.</li>
      </ul>

<p>Other problems have to do with sheer volume, i.e., transfer of data even in a local area network is too slow, let alone over a wide area network.  Computation needs to go to the data, and databases need to support this.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Services, software stacks, protocols, standards, benchmarks</b>
  </p>

<ol type="a" start="1">
	<li>
    <p>
        <b>What combinations of components are needed to deal with these problems?</b>
    </p>

<p>Recent times have seen a proliferation of special purpose databases.  Since the data needs of the future are about combining data with maximum agility and minimum performance hit, there is need to gather the currently-separate functionality into an integrated system with sufficient flexibility.  We see some of this in integration of map-reduce and scale-out databases.  The former antagonists have become partners. Vertica, <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x7a94e70">Greenplum</a>, and OpenLink <a href="http://virtuoso.openlinksw.com" id="link-id0x2ab2868">Virtuoso</a> are example of DBMS featuring work in this direction.</p>

<p>Interoperability and at least <i>de facto</i> standards in ways of doing this will emerge.</p>

</li>
	<li>
    <p>
        <b>What data exchange and processing mechanisms will be needed to work across platforms and programming languages?</b>
    </p>

<p>
        <a href="http://dbpedia.org/resource/Hypertext_Transfer_Protocol" id="link-id0x78a0458">HTTP</a>, <a href="http://dbpedia.org/resource/XML" id="link-id0x7ff2360">XML</a>, and RDF are in fact very verbose, yet these are the formats and models that have uptake.  Thus, these will continue to be used even though one might think binary formats to be more efficient.</p>

<p>There are of course science data set standards that are more compressed and these will continue, hopefully adding a practice of rich metadata in RDF.</p>

<p>For internals of systems, MPI and TCP/IP with proprietary optimized wire formats will continue.  Inter-system communication will likely continue to be HTTP, XML, and RDF as appropriate.</p>


</li>
	<li>
    <p>
        <b>What data environments are today so wastefully messy that they would benefit from the development of standards?</b>
    </p>


<p>RDF and <a href="http://dbpedia.org/resource/Web_Ontology_Language" id="link-id0x5643d70">OWL</a> are not messy but they could use some more performance; we are working on this.  <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x152ab18">SPARQL</a> is finally acquiring the capabilities of a serious query language, so things are slowly coming together.</p>

<p>Community process for developing application domain specific vocabularies works quite well, even though one could argue it is <i>ad hoc</i> and not up to what a modeling purist might wish.</p>

<p>Top-down imposition of standards has a mixed history, with long and expensive development and sometimes no or little uptake, consider some WS* standards for example.</p>

</li>
	<li>
    <p>
        <b>What kind of performance is expected or required of these systems? Who will measure it reliably? How?</b>
    </p>

<p>Relational databases have a history of substantial investment in <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0xecc100">optimization</a> and some of them are very good for what they do, e.g., the newer generation of analytics databases.</p>

<p>The very large schema-last, no-SQL, sometimes eventually consistent key-value stores have a somewhat shorter history but do fill a real need.</p>

<p>These trends will merge:  Extreme scale, schema-last, complex queries, even more complex inference, custom code for in-database machine learning and other bulk processing.</p>

<p>We find RDF augmented with some binary types at this crossroads.  This point of the design space will have to provide performance roughly on the level of today&#39;s best relational solution for workloads that fit the relational model.  The added cost of schema-last and inference must come down.  We are working on this.  Research work such as carried out with <a href="http://dbpedia.org/resource/MonetDB" id="link-id0x7ae2890">MonetDB</a> gives clues as to how these aims can be reached.</p>

<p>The separation of query language and inference is artificial.  After the concepts are mature, these functions will merge and execute close to the data; there are clear evolutionary pressures in this direction.</p>

<p>Benchmarks are key.  Some gain can be had even from repurposing standard relational benchmarks like <a href="http://www.tpc.org/" id="link-id0x71eb528">TPC</a>-<a href="http://dbpedia.org/resource/TPC-H" id="link-id0x5e16a40">H</a>.  But the TPC-H rules do not allow official reporting of such.</p>

<p>Development of benchmarks for RDF, complex queries, and inference is needed.  A bold challenge to the community, it should be rooted in real-life integration needs and involve high heterogeneity.  A key-value store benchmark might also be conceived.  A transaction benchmark like TPC-<a href="http://dbpedia.org/resource/C%2B%2B" id="link-id0x78562d0">C</a> might be the basis, maybe augmented with massive user-generated content like reviews and blogs.</p>

<p>If benchmarks exist and are not too easy nor inaccessibly difficult nor too expensive to run — think of the high end TPC-C results — then TPC-style rules and processes would be quite adequate.  The threshold to publish should be lowered:  Everybody runs the TPC workloads internally but few publish.</p>

<p>Some EC initiative for benchmarking could make sense, similar to the TREC initiative of the US government.  Industry should be consulted for the specific content; possibly the answers to the present questionnaire can provide an approximate direction.</p>

<p>Benchmarks should be run by software vendors on their own systems, tuned by themselves.  But there should be a process of disclosure and auditing; the TPC rules give an example.  Compliance should not be too expensive or time consuming.  Some community development for automating these things would be a worthwhile target for EC funding.</p>

</li>
  </ol>
</li>

<li>
  <p>
    <b>Usability and training</b>
  </p>

<ol type="a" start="1">

	<li>
    <p>
        <b>How difficult will it be for a developer of average competence to deploy components whose core is based on rather deep computer science? Do we all need to understand Monads and Continuations? What can be done to make it ever easier?</b>
    </p>

<p>In the database world, huge advances in technology have taken place behind a relatively simple and stable interface: SQL.  For the linked data <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x7761e50">web</a>, the same will take place behind SPARQL.</p>

<p>Beyond these, for example, programming with MPI with good utilization of a cluster platform for an arbitrary algorithm, is quite difficult.  The casual amateur is hereby warned.</p>

<p>There is no single solution.  For automatic parallelization, since explicit, programmatic parallelization of things with MPI for example is very unscalable in terms of required skill, we should favor declarative and/or functional approaches.</p>

<p>Developing a debugger and explanation engine for rule-based and description-logics-based inference would be an idea.</p>

<p>For procedural workloads, things like Erlang may be good in cases and are not overly difficult in principle, especially if there are good debugging facilities.</p>

<p>For shipping functions in a cluster or cloud, the <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x5494b0">BOOM</a> (<a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id0x7f1f148">Berkeley Orders Of Magnitude</a>) approach or logic programming with explicit specification of compute location seem promising, surely more flexible than map-reduce.  The question is whether a <a href="http://dbpedia.org/resource/PHP" id="link-id0x5c758c8">PHP</a> developer can be made to do logic programming.</p>

<p>This bridge will be crossed only with actual need and even then reluctantly.  We may look at the Web 2.0 practice of sharding <a href="http://dbpedia.org/resource/MySQL" id="link-id0x432f868">MySQL</a>, inconvenient as this may be, for an example.  There is inertia and thus re-architecting is a constant process that is generally in reaction to facts, <i>post hoc</i>, often a point solution.  One could argue that planning ahead would be smarter but by and large the world does not work so.</p>

<p>One part of the answer is an infinitely-scalable SQL database that expands and shrinks in the clouds, with the usual semantics, maybe optional eventual consistency and built-in map reduce.  If such a thing is inexpensive enough and syntax-level-compatible with present installed base, many developers do not have to learn very much more.</p>

<p>This is maybe good for the bread-and-butter IT, but European competitiveness should not rest on this.  Therefore we wish to go for bold new application types for which the client-server database application is not the model.  Data-centric languages like BOOM, if they can be made very efficient and have good debugging support, are attractive there.  These do require more intellectual investment but that is not a problem since the less-inquisitive part of the developer community is served by the first part of the answer.</p>

</li>
	<li>
    <p>
        <b>How is a developer of average skills going to learn about these new advanced tools? How can we plan for excellent documentation and training, community mentoring, exchange of good practices, etc... across all EU countries?</b>
    </p>

<p>For the most part, developers do not learn things for the sake of learning.  When they have learned something and it is adequate, they stay with it for the most part and are even reluctant to engage in cross-camps interaction.  The research world is often similarly insular.  A new inflection in the application landscape is needed to drive learning.  This inflection is provided by the <a href="https://wiki.mozilla.org/Labs/Ubiquity" id="link-id0x7f051c8">ubiquity</a> of mobile devices, sensor data, explicit semantics, NLP concept extraction, web of linked data, and such factors.</p>

<p>RDFa is a good example of a new technique piggybacking on something everybody uses, namely HTML.  These new things should, within possibility, be deployed in the usual technology stack, <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29" id="link-id0x77151e0">LAMP</a> or Java.  Of course these do not have to be LAMP or Java or HTML or HTTP themselves but they must manifest through these.</p>

<p>A lot of the <a href="http://dbpedia.org/resource/Semantic_Web" id="link-id0x7940cd0">semantic web</a> potential can be realized within the client-server database application model, thus no fundamental re-architecting, just some new data types and queries.</p>

<p>For data- or processing-intensive tasks, an on-demand hookup to cloud-based servers with Erlang and/or BOOM for programming model would be easy enough to learn and utilize.</p>

<p>The question is one of providing challenges.  Addressing actual challenges with these techniques will lead to maturity, documentation, examples, and training.  With virtual, Europe-wide distributed teams a reality in many places, Europe-wide dissemination is no longer insurmountable.</p>

<p>As the data overflow proceeds, its victims will multiply and create demand for solutions.  The EC could here encourage research project use cases gaining an extended life past the end of research projects, possibly being maintained and multiplied and spun off.</p>

<p>If such things could be mutated into self-sustaining service businesses with pay-per-use revenue, say through a cloud SaaS business model, still primarily leveraging an open source technology stack, we could have self-propagating and self-supporting models for exploiting advanced IT.  This would create interest, and interest would drive training and dissemination.</p>

<p>The problem is creating the pull.</p>
</li>
  </ol>
</li>

<li>
  <p>
    <b>Challenges</b>
  </p>
<ol type="a" start="1">

	<li>
    <p>
        <b>What should be, in this domain, the equivalent of the Netflix challenge, Ansari X Prize, <a href="http://dbpedia.org/resource/Google" id="link-id0x7e72f40">Google</a> Lunar X Prize, etc. ... ?</b>
    </p>

<p>The EC itself no doubt suffers from data overflow in one function or another.  Unless security/secrecy prohibits, simply publishing a large data set and a description of what operations should be done on it would be a start.  The more real the data, the better — reality is consistently more complex and surprising than imagination.  Since many interesting problems touch on fraud detection and law enforcement, there may be some security obstacles for using these application domains as subject matters of open challenges.</p>

<p>Once there is a good benchmark, as discussed above, there can be some prize money allocated for the winners, specially if the race is tight.</p>

<p>The Semantic Web Challenge and the Billion Triples Challenge exist and are useful as such, but do not seem to have any huge impact.</p>

<p>The incentives should be sufficient and part of the expenses arising from running for such challenges could be funded.  Otherwise investing in existing business development will be more interesting to industry.  Some industry participation seems necessary; we would wish academia and industry to work closer.  Also, having industry supply the baseline guarantees that academia actually does further the state of the art.  This is not always certain.</p>

<p>If challenges are based on actual problems, whether of the EC, its member governments, or private entities, and winning the challenge may lead to a contract for supplying an actual solution, these will naturally become more interesting for consortia involving integrators, specialist software vendors, and academia.  Such a model would build actual capacity to deploy leading edge technologies in production, which is sorely needed.</p>


</li>
	<li>
    <p>
        <b>What should one do  to set up such a challenge, administer, and monitor it?</b>
    </p>

<p>The EC should probably circulate a call for actual problem scenarios involving big data.  If the matter of the overflow is as dire as represented, cases should be easy to find.  A few should be selected and then anonymized if needed.</p>

<p>The party with the use case would benefit by having hopefully the best work on it.  The contestants would benefit from having real world needs guide R&amp;D.  The EC would not have to do very much, except possibly use some money for funding the best proposals.  The winner would possibly get a large account and related sales and service income.  The contestants would have to be teams possibly involving many organizations; for example, development and first-line services and support could come from different companies along a systems integrator model such as is widely used in the US.</p>

<p>There may be a good benchmark at the time, possibly resulting from FP7 itself.  In such a case, the EC could offer a prize for winners.  Details would have to be worked out case by case.  Such a challenge could be repeated a few times, as benchmark-driven progress in databases or TREC for example have taken some years to reach a point of slowdown in progress.</p>

<p>Administrating such an activity should not be prohibitive, as most of the expertise can be found with the stakeholders.</p>

</li>
  </ol>
</li>
</ol>
]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1583">
  <rss:title>VLDB 2009 Web Scale Data Management Panel (5 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T16:24:17Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">&quot;The universe of cycles is not exactly one of literal cycles, but rather one of spirals,&quot; mused Joe Hellerstein of UC Berkeley. &quot;Come on, let&#39;s all drop some ACID,&quot; interjected another. &quot;It is not that we end up repeating the exact same things, rather even if some patterns seem to repeat, they do so at a higher level, enhanced by the experience gained,&quot; continued Joe. Thus did the Web Scale Data Management panel conclude. Whether successive generations are made wiser by the ones that have gone before may be argued either way. The cycle in question was that of developers discovering ACID in the 1960s, i.e. Atomicity, Consistency, Integrity, Durability. Thus did the DBMS come into being. Then DBMSs kept becoming more complex until, as there will be a counter-force to each force, came the meme of key value stores and BASE, no multiple-row transactions, eventual consistency, no query language but scaling to thousands of computers. So now, the DBMS community asks itself what went wrong. In the words of one panelist, another demonstrated a &quot;shocking familiarity with the subject matter of substance abuse&quot; when he called for the DBMS community to get on a 12 step program and to look where addiction to certain ideas, among which ACID, had brought its life. Look at yourself: The influential papers in what ought to be your space by rights are coming from the OS community: Google Bigtable, Amazon Dynamo, want more? When you ought to drive, you give excuses and play catch up! Stop denial, drop SQL, drop ACID! The web developers have revolted against the time-honored principles of the DBMS. This is true. Sharded MySQL is not the ticket — or is it? Must they rediscover the virtues of ACID, just like the previous generation did? Nothing under the sun is new. As in music and fashion, trends keep cycling also in science and engineering. But seriously, does the full-featured DBMS scale to web scale? Microsoft says the Azure version of SQL server does. Yahoo says they want no SQL but Hadoop and PNUTS. Twitter, Facebook, and other web names got their own discussion. Why do they not go to serious DBMS vendors for their data but make their own, like Facebook with Hive? Who can divine the mind of the web developer? What makes them go to memcached, manually sharded MySQL, and MapReduce, walking away from the 40 years of technology invested in declarative query and ACID? What is this highly visible but hard to grasp entity? My guess is that they want something they can understand, at least at the beginning. A DBMS, especially on a cluster, is complicated, and it is not so easy to say how it works and how its performance is determined. The big brands, if deployed on a thousand PCs, would also be prohibitively expensive. But if all you do with the DBMS is single row selects and updates, it is no longer so scary, but you end up doing all the distributed things in a middle layer, and abandoning expressive queries, transactions, and database-supported transparency of location. But at least now you know how it works and what it is good/not good for. This would be the case for those who make a conscious choice. But by and large the choice is not deliberate; it is something one drifts into: The application gains popularity; the single LAMP can no longer keep all in memory; you need a second MySQL in the LAMP and you decide that users A–M go left and N–Z right (horizontal partitioning). This siren of sharding beckons you and all is good until you hit the reef of re-architecting. Memcached and duct-tape help, like aspirin helps with hangover, but the root cause of the headache lies unaddressed. The conclusion was that there ought to be something incrementally scalable from the get-go. Low cost of entry and built-in scale-out. No, the web developers do not hate SQL; they just have gotten the idea that it does not scale. But they would really wish it to. So, DBMS people, show there is life in you yet. Joe Hellerstein was the philosopher and paradigmatician of the panel. His team had developed a protocol-compatible Hadoop in a few months using a declarative logic programming style approach. His claim was that developers made the market. Thus, for writing applications against web scale data, there would have to be data centric languages. Why not? These are discussed in Berkeley Orders Of Magnitude (BOOM). I come from Lisp myself, way back. I have since abandoned any desire to tell anybody what they ought to program in. This is a bit like religion: Attempting to impose or legislate or ram it on somebody just results in anything from lip service to rejection to war. The appeal exerted by the diverse language/paradigm -isms on their followers seems to be based on hitting a simplification of reality that coincides with a problem in the air. MapReduce is an example of this. PHP is another. A quick fix for a present need: Scripting web servers (PHP) or processing tons of files (MapReduce). The full database is not as quick a fix, even though it has many desirable features. It is also not as easy to tell what happens inside one, so MapReduce may give a greater feeling of control. Totally self-managing, dynamically-scalable RDF would be a fix for not having to design or administer databases: Since it would be indexed on everything, complex queries would be possible; no full database scans would stop everything. For the mid-size segment of web sites this might be a fit. For the extreme ends of the spectrum, the choice is likely something custom built and much less expressive. The BOOM rule language for data-centric programming would be something very easy for us to implement, in fact we will get something of the sort essentially for free when we do the rule support already planned. The question is, can one induce web developers to do logic? The history is one of procedures, both in LAMP and MapReduce. On the other hand, the query languages that were ever universally adopted were declarative, i.e., keyword search and SQL. There certainly is a quest for an application model for the cloud space beyond just migrating apps. We&#39;ll see. More on this another time.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<blockquote>
 <p>
  <i>&quot;The universe of cycles is not exactly one of literal cycles, but rather one of spirals,&quot; mused <a href="http://db.cs.berkeley.edu/jmh/" id="link-id117455a0">Joe Hellerstein</a> of UC Berkeley.</i>
 </p>
<p>
  <i>&quot;Come on, let&#39;s all drop some <a href="http://dbpedia.org/resource/ACID" id="link-id16b3db50">ACID</a>,&quot; interjected another.</i>
</p>
<p>
  <i>&quot;It is not that we end up repeating the exact same things, rather even if some patterns seem to repeat, they do so at a higher level, enhanced by the experience gained,&quot; continued Joe.</i>
</p>
</blockquote>

<p>Thus did the Web Scale <a href="http://dbpedia.org/resource/Data" id="link-id11061ae0">Data</a> Management panel conclude.</p>

<p>Whether successive generations are made wiser by the ones that have gone before may be argued either way.</p>

<p>The cycle in question was that of developers discovering ACID in the 1960s, i.e. Atomicity, Consistency, Integrity, Durability.  Thus did the DBMS come into being.  Then DBMSs kept becoming more complex until, as there will be a counter-force to each force, came the <a href="http://dbpedia.org/resource/Meme" id="link-id11076cc8">meme</a> of key value stores and BASE, no multiple-row transactions, eventual consistency, no query language but scaling to thousands of computers.  So now, the DBMS community asks itself what went wrong.</p>

<p>In the words of one panelist, another demonstrated a &quot;shocking familiarity with the subject matter of substance abuse&quot; when he called for the DBMS community to get on a <a href="http://dbpedia.org/resource/Twelve-step_program" id="link-id15d954a8">12 step program</a> and to look where addiction to certain ideas, among which ACID, had brought its life.  Look at yourself: The influential papers in what ought to be your space by rights are coming from the OS community: <a href="http://dbpedia.org/resource/Google" id="link-id166675f0">Google</a> Bigtable, Amazon Dynamo, want more? When you ought to drive, you give excuses and play catch up!  Stop denial, drop <a href="http://dbpedia.org/resource/SQL" id="link-id1105adf0">SQL</a>, drop ACID!</p>

<p>The web developers have revolted against the time-honored principles of the DBMS.  This is true.  Sharded <a href="http://dbpedia.org/resource/MySQL" id="link-id1221c230">MySQL</a> is not the ticket — or is it?  Must they rediscover the virtues of ACID, just like the previous generation did?</p>

<p>Nothing under the sun is new.  As in music and fashion, trends keep cycling also in science and engineering.</p>

<p>But seriously, does the full-featured DBMS scale to web scale?  <a href="http://dbpedia.org/resource/Microsoft" id="link-id10ffcaf8">Microsoft</a> says the Azure version of SQL server does.  <a href="http://dbpedia.org/resource/Yahoo%21" id="link-id16b3f138">Yahoo</a> says they want no SQL but <a href="http://dbpedia.org/resource/Hadoop" id="link-id11046ef0">Hadoop</a> and <a href="http://research.yahoo.com/node/2304" id="link-id110a0040">PNUTS</a>.</p>

<p>Twitter, Facebook, and other web names got their own discussion.  Why do they not go to serious DBMS vendors for their data but make their own, like Facebook with Hive?</p>

<p>Who can divine the mind of the web developer?  What makes them go to <a href="http://www.danga.com/memcached/" id="link-id1109e280">memcached</a>, manually sharded MySQL, and <a href="http://dbpedia.org/resource/MapReduce" id="link-id1107cd60">MapReduce</a>, walking away from the 40 years of technology invested in declarative query and ACID?  What is this highly visible but hard to grasp <a href="http://dbpedia.org/resource/Entity" id="link-id1105b6b8">entity</a>?  My guess is that they want something they can understand, at least at the beginning.  A DBMS, especially on a cluster, is complicated, and it is not so easy to say how it works and how its performance is determined.  The big brands, if deployed on a thousand PCs, would also be prohibitively expensive.  But if all you do with the DBMS is single row selects and updates, it is no longer so scary, but you end up doing all the distributed things in a middle layer, and abandoning expressive queries, transactions, and database-supported transparency of location.  But at least now you know how it works and what it is good/not good for.</p>

<p>This would be the case for those who make a conscious choice.  But by and large the choice is not deliberate; it is something one drifts into: The application gains popularity; the single <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29" id="link-iddc68d28">LAMP</a> can no longer keep all in memory; you need a second MySQL in the LAMP and you decide that users A–M go left and N–Z right (horizontal partitioning).  This siren of sharding beckons you and all is good until you hit the reef of re-architecting.  Memcached and duct-tape help, like aspirin helps with hangover, but the root cause of the headache lies unaddressed.</p>

<p>The conclusion was that there ought to be something incrementally scalable from the get-go.  Low cost of entry and built-in scale-out. No, the web developers do not hate SQL; they just have gotten the idea that it does not scale.  But they would really wish it to.  So, DBMS people, show there is life in you yet.</p>

<p>Joe Hellerstein was the philosopher and paradigmatician of the panel. His team had developed a protocol-compatible Hadoop in a few months using a declarative logic programming style approach.  His claim was that developers made the market.  Thus, for writing applications against web scale data, there would have to be data centric languages.  Why not?  These are discussed in <a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id110ba0e0">Berkeley Orders Of Magnitude</a> (<a href="http://www.eecs.berkeley.edu/Research/Projects/Data/105733.html" id="link-id16aab768">BOOM</a>).</p>

<p>I come from <a href="http://en.wikipedia.org/wiki/Lisp_%28programming_language%29" id="link-id10f2cd68">Lisp</a> myself, way back.  I have since abandoned any desire to tell anybody what they ought to program in.  This is a bit like religion: Attempting to impose or legislate or ram it on somebody just results in anything from lip service to rejection to war.  The appeal exerted by the diverse language/paradigm -isms on their followers seems to be based on hitting a simplification of reality that coincides with a problem in the air.  MapReduce is an example of this. <a href="http://dbpedia.org/resource/PHP" id="link-ide22cdd0">PHP</a> is another.  A quick fix for a present need: Scripting web servers (PHP) or processing tons of files (MapReduce).  The full database is not as quick a fix, even though it has many desirable features.  It is also not as easy to tell what happens inside one, so MapReduce may give a greater feeling of control.</p>

<p>Totally self-managing, dynamically-scalable <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id152864b0">RDF</a> would be a fix for not having to design or administer databases: Since it would be indexed on everything, complex queries would be possible; no full database scans would stop everything.  For the mid-size segment of web sites this might be a fit.  For the extreme ends of the spectrum, the choice is likely  something custom built and much less expressive.</p>

<p>The BOOM rule language for data-centric programming would be something very easy for us to implement, in fact we will get something of the sort essentially for free when we do the rule support already planned.</p>

<p>The question is, can one induce web developers to do logic?  The history is one of procedures, both in LAMP and MapReduce.  On the other hand, the query languages that were ever universally adopted were declarative, i.e., keyword search and SQL. There certainly is a quest for an application model for the cloud space beyond just migrating apps.  We&#39;ll see.  More on this another time.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1581">
  <rss:title>VLDB 2009 Yahoo Keynote (4 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T16:04:36Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Raghu Ramakrishnan of Yahoo! gave a keynote about PNUTS, the Yahoo solution for managing massive user data, from front page preferences to mail to social networks. Dynamic scale, wide area replication, and high availability are the issues. Transactions on multiple records, complex queries, and absolute consistency at all times are traded off. Also, the programming interfaces are lower level than with SQL. Replication and consistency rules are choices for the application developer; the platform offers some basic alternatives. Implementation-wise, there is a MySQL back-end and all the partitioning, query routing, replication, and balancing take place in a layer of front-ends. Now what do we say to this? In the Yahoo! case, even if complex queries were possible, which they are not, one would probably keep them off the online system since latency and availability are everything. A latency of some tens of milliseconds is however acceptable, which is not so terrible for single record operations: There is time for a couple of messages on the data center network and even maybe for a disk read. PNUTS is probably the fastest way of getting to the desired beachhead of simple access to data at infinite scale in multiple geographies. In the identical situation, I might have done something similar. But we are in a different situation, concerned with complex queries, a highly-normalized schema-last situation, i.e., index on everything, large objects normalized away, as is done in RDF. Then we are also in the relational situation. Infinite scale, fault tolerance, and wide-area replication do come up regularly in user needs. The applications for which people would like RDF are not only complex reasoning things but very big metadata stores for user generated content, social networks, and the like. Which of the PNUTS principles could we apply? Division in tablets: When a partition of the data grows too big, it should split. Migration of partitions: as capacity/demand change, partitions should migrate so as to equalize load. High availability: This is divided in two — on one hand inside the data center; on the other between data centers. Inside the data center, storing partitions in duplicate and running them synchronously is possible. This is manifestly impossible in wide area settings, though. For this, we need a log-shipping style of asynchronous replication. But how does one deal with split networks and transfer of replication mastery? PNUTS determines the master copy record by record. This makes sense when the record, for example, corresponds to a user. For RDF, doing this by the triple would be prohibitive. Doing this by the graph, or by the subject of a set of triples across all graphs, would be better. We would agree with PNUTS that transferring mastery by the storage chunk is not desired, as the chunk will contain arbitrary unrelated data. The eventual consistency mechanisms can be generalized to RDF readily enough. In a social RDF application, the graph is the most likely unit of data ownership and update authorization, so the graph would also be the unit of eventual consistency. Keeping a separate data structure listing recent inserts/deletes to a graph with timestamps would serve for establishing consistency. The size of this would be a small fraction of the size of the graph. RDF cannot do anything without joining between partitions, whereas for PNUTS the join between partitions is an application matter. But then PNUTS does have an extra step of RPC between the PNUTS infrastructure and the back-end. Doing query routing in the back-end gets rid of this. RDF does remain more dependent on even performance and short interconnect latencies, though. It also likely takes more space. But the essential consistency and availability features can be generalized to it, providing the merge of semi-structured data at infinite scale and availability with complex query. At any rate, repartitioning-on-demand and partition-migration remain the key agenda items for us, confirmed over and over at VLDB.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>
<a href="http://dbpedia.org/resource/Raghu_Ramakrishnan" id="link-id0x177f3ef8">Raghu Ramakrishnan</a> of <a href="http://dbpedia.org/resource/Yahoo%21" id="link-id0x2a4aad0">Yahoo</a>! gave a keynote about <a href="http://research.yahoo.com/node/2304" id="link-id0x5584570">PNUTS</a>, the Yahoo solution for managing massive user <a href="http://dbpedia.org/resource/Data" id="link-id0x3805628">data</a>, from front page preferences to mail to social networks.</p>

<p>Dynamic scale, wide area replication, and high availability are the issues.  Transactions on multiple records, complex queries, and absolute consistency at all times are traded off.  Also, the programming interfaces are lower level than with <a href="http://dbpedia.org/resource/SQL" id="link-id0x17bfc928">SQL</a>.  Replication and consistency rules are choices for the application developer; the platform offers some basic alternatives.  Implementation-wise, there is a <a href="http://dbpedia.org/resource/MySQL" id="link-id0x1862f7a8">MySQL</a> back-end and all the partitioning, query routing, replication, and balancing take place in a layer of front-ends.</p>

<p>Now what do we say to this?</p>

<p>In the Yahoo! case, even if complex queries were possible, which they are not, one would probably keep them off the online system since latency and availability are everything.  A latency of some tens of milliseconds is however acceptable, which is not so terrible for single record operations:  There is time for a couple of messages on the data center network and even maybe for a disk read.</p>

<p>PNUTS is probably the fastest way of getting to the desired beachhead of simple access to data at infinite scale in multiple geographies. In the identical situation, I might have done something similar.</p>

<p>But we are in a different situation, concerned with complex queries, a highly-normalized <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x25c942e8">schema</a>-last situation, i.e., index on everything, large objects normalized away, as is done in <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x4a3d080">RDF</a>.  Then we are also in the relational situation.  Infinite scale, fault tolerance, and wide-area replication do come up regularly in user needs.  The applications for which people would like RDF are not only complex reasoning things but very big <a href="http://dbpedia.org/resource/Metadata" id="link-id0x19101128">metadata</a> stores for user generated content, social networks, and the like.</p>

<p>Which of the PNUTS principles could we apply?</p>

<ul>
 <li>
  <p>
    <b>Division in tablets:</b>  When a partition of the data grows too big, it should split.</p>
 </li>
<li>
  <p>
    <b>Migration of partitions:</b> as capacity/demand change, partitions should migrate so as to equalize load.</p>
</li>
<li>
  <p>
    <b>High availability:</b> This is divided in two — on one hand inside the data center; on the other between data centers.  Inside the data center, storing partitions in duplicate and running them synchronously is possible.  This is manifestly impossible in wide area settings, though.  For this, we need a log-shipping style of asynchronous replication.  But how does one deal with split networks and transfer of replication mastery?</p>
</li>
</ul>

<p>PNUTS determines the master copy record by record.  This makes sense when the record, for example, corresponds to a user.  For RDF, doing this by the triple would be prohibitive.  Doing this by the graph, or by the subject of a set of triples across all graphs, would be better.  We would agree with PNUTS that transferring mastery by the storage chunk is not desired, as the chunk will contain arbitrary unrelated data.</p>
<p>

</p>
<p>The eventual consistency mechanisms can be generalized to RDF readily enough.  In a social RDF application, the graph is the most likely unit of data ownership and update authorization, so the graph would also be the unit of eventual consistency.  Keeping a separate data structure listing recent inserts/deletes to a graph with timestamps would serve for establishing consistency.  The size of this would be a small fraction of the size of the graph.</p>

<p>RDF cannot do anything without joining between partitions, whereas for PNUTS the join between partitions is an application matter.  But then PNUTS does have an extra step of RPC between the PNUTS infrastructure and the back-end.  Doing query routing in the back-end gets rid of this.  RDF does remain more dependent on even performance and short interconnect latencies, though.  It also likely takes more space.  But the essential consistency and availability features can be generalized to it, providing the merge of semi-structured data at infinite scale and availability with complex query.</p>

<p>At any rate, repartitioning-on-demand and partition-migration remain the key agenda items for us, confirmed over and over at VLDB.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1580">
  <rss:title>VLDB 2009 TPC Workshop (3 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T15:51:09Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Michael Stonebraker gave the keynote at the TPC workshop. His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself. From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention. Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding optimization, as has pretty much everybody else. It is true that the rules encourage unrealistic configurations. The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of data, just so there are enough disk arms in parallel. Stonebraker also pointed out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long. Benchmarks should therefore include replication. Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites. Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases. They avoid them when can. They want arrays for physics, and graphs for biology and chemistry. MapReduce is eating database&#39;s lunch; what will you do about this? I later suggested incorporating an RDF metadata benchmark into the TPC suite. We&#39;ll see about this; we&#39;ll first have to come up with a suitable one. There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover. TPC&#39;s own talk was about the life cycle of benchmarks. A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon. When the solution to this problem becomes commonplace, the benchmark&#39;s relevance gradually drops. There was a talk on robustness of query plans which was well to the point. Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins. Quite so. The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of cache fusion and partitioning. We have our own data and experience but we find we don&#39;t have time to measure all the other systems. Anyway it is good to raise the question of smooth and predictable performance.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>Michael <a href="http://dbpedia.org/resource/Michael_Stonebraker" id="link-id0x1641ef70">Stonebraker</a> gave the keynote at the <a href="http://www.tpc.org/" id="link-id0x554d380">TPC</a> workshop.  His message was that the TPC, at the venerable age of 21, was already a decade late in reinventing itself.  From the height of relevance at the time of the debit/credit benchmark twenty years back, it was slipping into the sunset of irrelevance unless it paid attention.</p>

<p>Now we are great fans of the TPC and while we have not published results by the TPC book, we have extensively used TPC material for guiding <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0x16475bd8">optimization</a>, as has pretty much everybody else.</p>

<p>It is true that the rules encourage unrealistic configurations.  The emphasis on random access from disk that is built into the rules leads to disk configurations that are very improbable in practice, such as 1PB of disks for 3TB of <a href="http://dbpedia.org/resource/Data" id="link-id0x18f0b720">data</a>, just so there are enough disk arms in parallel.  Stonebraker also pointed  out that replication and failover were ubiquitous in real life and that roll forward from logs was unrealistic as a recovery model since it took so long.  Benchmarks should therefore include replication.</p>

<p>Further, Stonebraker challenged the TPC to go for the new frontier, which he described as the huge data sets in science and on big web sites.  Scientists, the ones who would save our planet from the diverse ills confronting it, do not like relational databases.  They avoid them when can.  They want arrays for physics, and graphs for biology and chemistry.  <a href="http://dbpedia.org/resource/MapReduce" id="link-id0x150376a8">MapReduce</a> is eating database&#39;s lunch; what will you do about this?</p>

<p>I later suggested incorporating an <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x41cd4c0">RDF</a> <a href="http://dbpedia.org/resource/Metadata" id="link-id0x15904698">metadata</a> benchmark into the TPC suite.  We&#39;ll see about this; we&#39;ll first have to come up with a suitable one.  There is a great deal of pressure for making good RDF benchmarks but this is not yet in the center of the mainstream that TPC tends to cover.</p>

<p>TPC&#39;s own talk was about the life cycle of benchmarks.  A benchmark begins a bit ahead of the mainstream, with a problem that is difficult but not so difficult as to be uncommon.  When the solution to this problem becomes commonplace, the benchmark&#39;s relevance gradually drops.</p>

<p>There was a talk on robustness of query plans which was well to the point.  Indeed, there are performance cliffs at certain points; for example, when passing from memory-only to disk-pageable data structures, or when switching from indexed access to table scans, or from loop to hash joins.  Quite so.  The analysis I really would have liked to see would have been one of what happens when passing from single server to a cluster, and from local joins to cross-partition ones. Also contrasting of <a href="http://dbpedia.org/resource/Cache" id="link-id0x16dd6710">cache</a> fusion and partitioning.  We have our own data and experience but we find we don&#39;t have time to measure all the other systems.</p>

<p>Anyway it is good to raise the question of smooth and predictable performance.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1579">
  <rss:title>Some Interesting VLDB 2009 Papers (2 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T15:46:14Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">Intel on Hash Join Intel and Oracle had measured hash and sort merge joins on Intel Core i7. The result was that hash join with both tables partitioned to match CPU cache was still the best but that sort/merge would catch up with more SIMD instructions in the future. We should probably experiment with this but the most important partitioning of hash joins is still between cluster nodes. Within the process, we will see. The tradeoff of doing all in cache-sized partitions is larger intermediate results which in turn will impact the working set of disk pages in RAM. For one-off queries this is OK; for online use this has an effect. 1000 TABLE Queries SAP presented a paper about federating relational databases. Queries would be expressed against VIEWs defined over remote TABLEs, UNIONed together and so forth. Traditional methods of optimization would run out of memory; a single 1000 TABLE plan is already a big thing. Enumerating multiple variations of such is not possible in practice. So the solution was to plan in two stages — first arrange the subqueries and derived TABLEs, and then do the JOIN orders locally. Further, local JOIN orders could even be adjusted at run time based on the actual data. Nice. Oracle Subqueries and New Implementation of LOBs Oracle presented some new SQL optimizations, combining and inlining subqueries and derived TABLEs. We do fairly similar things and might extend the repertoire of tricks in the direction outlined by Oracle as and when the need presents itself. This further confirms that SQL and other query optimization is really an incremental collection of specially recognized patterns. We still have not found any other way of doing it. Another interesting piece by Oracle was about their re-implementation of large object support, where they compared LOB loading to file system and raw device speeds. Amadeus CRS booking system, steady query time for arbitrary single table queries There was a paper about a memory-resident database that could give steady time for any kind of single-table scan query. The innovation was to not use indices, but to have one partition of the table per processor core, all in memory. Then each core would have exactly two cursors — one reading, the other writing. The write cursor should keep ahead of the read cursor. Like this, there would be no read/write contention on pages, no locking, no multiple threads splitting a tree at different points, none of the complexity of a multithreaded database engine. Then, when the cursor would hit a row, it would look at the set of queries or updates and add the result to the output if there was a result. The data indexes the queries, not the other way around. We have done something similar for detecting changes in a full text corpus but never thought of doing queries this way. Well, we are all about JOINs so this is not for us, but it deserves a mention for being original and clever. And indeed, anything one can ask about a table will likely be served with great predictability. Greenplum Google&#39;s chief economist said that the winning career choice would be to pick a scarce skill that made value from something that was plentiful. For the 2010s this career is that of the statistician/data analyst. We&#39;ve said it before — the next web is analytics for all. The Greenplum talk was divided between the Fox use case, with 200TB of data about ads, web site traffic, and other things, growing 5TB a day. The message was that cubes and drill down are passé, that it is about complex statistical methods that have to run in the database, that the new kind of geek is the data geek, whose vocation it is to consume and spit out data, discover things in it, and so forth. The technical part was about Greenplum, a SQL database running on a cluster with a PostgreSQL back-end. The interesting points were embedding MapReduce into SQL, and using relational tables for arrays and complex data types — pretty much what we also do. Greenplum emphasized scale-out and found column orientation more like a nice-to-have. MonetDB, optimizing database for CPU cache The MonetDB people from CWI in Amsterdam gave a 10 year best paper award talk about optimizing database for CPU cache. The key point was that if data is stored as columns, it ought also to be transferred as columns inside the execution engine. Materialize big chunks of state to cut down on interpretation overhead and use cache to best effect. They vector for CPU cache; we vector for scale-out, since the only way to ship operations is to ship many at a time. So we might as well vector also in single servers. This could be worth an experiment. Also we regularly visit the topic of column storage. But we are not yet convinced that it would be better than row-style covering indices for RDF quads. But something could certainly be tried, given time.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<h3>
<a href="http://dbpedia.org/resource/Intel_Corporation" id="link-id0x449c5e0">Intel</a> on <a href="http://dbpedia.org/resource/Hash_join" id="link-id0x4e82430">Hash Join</a>
</h3>

<p>Intel and <a href="http://dbpedia.org/resource/Oracle_Database" id="link-id0x10bae5e8">Oracle</a> had measured hash and sort merge joins on Intel Core i7.  The result was that hash join with both tables partitioned to match <a href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x3827798">CPU</a> <a href="http://dbpedia.org/resource/Cache" id="link-id0x2545b978">cache</a> was still the best but that sort/merge would catch up with more <a href="http://dbpedia.org/resource/SIMD" id="link-id0x32f4e40">SIMD</a> instructions in the future.</p>

<p>We should probably experiment with this but the most important partitioning of hash joins is still between cluster nodes.  Within the process, we will see.  The tradeoff of doing all in cache-sized partitions is larger intermediate results which in turn will impact the working set of disk pages in RAM.  For one-off queries this is OK; for online use this has an effect.</p>

<h3>1000 TABLE Queries</h3>

<p>
<a href="http://dbpedia.org/resource/SAP_AG" id="link-id0x4ed7710">SAP</a> presented a paper about <a href="http://dbpedia.org/resource/Federated_database_system" id="link-id0x26827fd8">federating relational databases</a>.  Queries would be expressed against VIEWs defined over remote TABLEs, UNIONed together and so forth.  Traditional methods of <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0x3838888">optimization</a> would run out of memory; a single 1000 TABLE plan is already a big thing. Enumerating multiple variations of such is not possible in practice. So the solution was to plan in two stages — first arrange the subqueries and derived TABLEs, and then do the JOIN orders locally. Further, local JOIN orders could even be adjusted at run time based on the actual <a href="http://dbpedia.org/resource/Data" id="link-id0x26033030">data</a>.  Nice.</p>

<h3>Oracle Subqueries and New Implementation of LOBs</h3>

<p>Oracle presented some new <a href="http://dbpedia.org/resource/SQL" id="link-id0x23a0eb48">SQL</a> optimizations, combining and inlining subqueries and derived TABLEs.  We do fairly similar things and might extend the repertoire of tricks in the direction outlined by Oracle as and when the need presents itself.  This further confirms that SQL and other query optimization is really an incremental collection of specially recognized patterns. We still have not found any other way of doing it.</p>

<p>Another interesting piece by Oracle was about their re-implementation of large object support, where they compared LOB loading to file system and raw device speeds.</p>


<h3>
<a href="http://dbpedia.org/resource/Amadeus_CRS" id="link-id0x1566d470">Amadeus CRS</a> booking system, steady query time for arbitrary single table queries</h3>

<p>There  was a paper about a memory-resident database that could give steady time for any kind of single-table scan query.  The innovation was to not use indices, but to have one partition of the table per processor core, all in memory.  Then each core would have exactly two cursors — one reading, the other writing.  The write cursor should keep ahead of the read cursor.  Like this, there would be no read/write contention on pages, no locking, no multiple threads splitting a tree at different points, none of the complexity of a multithreaded database engine. Then, when the cursor would hit a row, it would look at the set of queries or updates and add the result to the output if there was a result.  The data indexes the queries, not the other way around.  We have done something similar for detecting changes in a full text corpus but never thought of doing queries this way.</p>

<p>Well, we are all about JOINs so this is not for us, but it deserves a mention for being original and clever.  And indeed, anything one can ask about a table will likely be served with great predictability.</p>

<h3>
<a href="http://dbpedia.org/resource/Greenplum" id="link-id0x196b0538">Greenplum</a>
</h3>

<p>
<a href="http://dbpedia.org/resource/Google" id="link-id0x108f8878">Google</a>&#39;s chief economist said that the winning career choice would be to pick a scarce skill that made value from something that was plentiful.  For the 2010s this career is that of the statistician/data analyst.  We&#39;ve said it before — the next web is analytics for all.  The Greenplum talk was divided between the Fox use case, with 200TB of data about ads, web site traffic, and other things, growing 5TB a day.  The message was that cubes and drill down are passé, that it is about complex statistical methods that have to run in the database, that the new kind of geek is the data geek, whose vocation it is to consume and spit out data, discover things in it, and so forth.</p>

<p>The technical part was about Greenplum, a SQL database running on a cluster with a <a href="http://dbpedia.org/resource/PostgreSQL" id="link-id0x3106d00">PostgreSQL</a> back-end.  The interesting points were embedding <a href="http://dbpedia.org/resource/MapReduce" id="link-id0x17968370">MapReduce</a> into SQL, and using relational tables for arrays and complex data types — pretty much what we also do.  Greenplum emphasized scale-out and found column orientation more like a nice-to-have.</p>

<h3>
<a href="http://dbpedia.org/resource/MonetDB" id="link-id0x119f7948">MonetDB</a>, optimizing database for CPU cache</h3>

<p>The MonetDB people from <a href="http://dbpedia.org/resource/National_Research_Institute_for_Mathematics_and_Computer_Science" id="link-id0x3617658">CWI</a> in Amsterdam gave a 10 year best paper award talk about optimizing database for CPU cache.  The key point was that if data is stored as columns, it ought also to be transferred as columns inside the execution engine.  Materialize big chunks of state to cut down on interpretation overhead and use cache to best effect.  They vector for CPU cache; we vector for scale-out, since the only way to ship operations is to ship many at a time.  So we might as well vector also in single servers.  This could be worth an experiment.  Also we regularly visit the topic of <a href="http://dbpedia.org/resource/Column-oriented_DBMS" id="link-id0x38d43d8">column storage</a>.  But we are not yet convinced that it would be better than row-style covering indices for <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x17e25760">RDF</a> quads.  But something could certainly be tried, given time.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1578">
  <rss:title>VLDB 2009 (1 of 5)</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T15:30:37Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">I was at the VLDB 2009 conference in Lyon, France. I will in the next few posts discuss some of the prominent themes and how they relate to our products or to RDF and Linked Data. Firstly, RDF was as good as absent from the presentations and discussions we saw. There were a few mentions in the panel on structured data on the web, however RDF was not in any way seen to be essential for this. There were also a couple of RDF mentions in questions at other sessions, but that was about it. It is a common perception that RDF and database people do not talk with each other. Evidence seems to bear this out. As a database developer I did get a lot of readily applicable ideas from the VLDB talks. These run across the whole range of DBMS topics, from key compression and SQL optimization, to column storage, CPU cache optimization, and the like. In this sense, VLDB is directly relevant to all we do. In a conversation, someone was mildly confused that I should on one hand mention I was doing RDF, and on the other hand also be concerned about database performance. These things are not seen to belong together, even though making RDF do something useful certainly depends on a great deal of database optimization. The question of all questions — that of infinite scale-out with complex queries, resilience, replication, and full database semantics — was strongly in the air. But it was in the air more as a question than as an answer. Not very much at all was said about the performance of distributed query plans, of 2pc (two-phase commit), of the impact of interconnect latency, and such things. On the other hand, people were talking quite liberally about optimizing CPU cache and local multi-core execution, not to mention SQL plans and rewrites. Also, almost nothing was said about transactions. Still, there is bound to be a great deal of work in scale-out of complex workloads by any number of players. Either these things are all figured out and considered self-evidently trivial, or they are so hot that people will go there only by way of allusion and vague reference. I think it is the latter. By and large, we were confirmed in our understanding that infinite scale-out on the go, with redundancy, is the ticket, especially if one can offer complex queries and transactional semantics coupled with instant data loading and schema-last. Column storage and cache optimizations seem to come right after these. Certainly the database space is diversifying. MapReduce was discussed quite a bit, as an intruder into what would be the database turf. We have no great problem with MapReduce; we do that in SQL procedures if one likes to program in this way. Greenplum also seems to have come by the same idea. As said before, RDF and RDF reasoning were ignored. Do these actually offer something to the database side? Certainly for search, discovery, integration, and resource discovery, linked data has evident advantages. Two points of the design space — the warehouse, and the web-scale key-value store — got a lot of attention. Would I do either in RDF? RDF is a slightly different design space point, like key-value with complex queries — on the surface, a fusion of the two. As opposed to RDF, the relational warehouse gains from fixed data-types and task-specific layout, whether row or column. The key-value store gains from having a concept of a semi-structured record, a bit like the RDF subject of a triple, but now with ad-hoc (if any) secondary indices, and inline blobs. The latter is much simpler and more compact than the generic RDF subject with graphs and all, and can be easily treated as a unit of version control and replication mastering. RDF, being more generic and more normalized, is representationally neither as ad-hoc nor as compact. But RDF will be the natural choice when complex queries and ad-hoc schema meet, for example in web-wide integrations of application data. There seems to be a huge divide in understanding between database-developing people and those who would be using databases. On one side, this has led to a back-to-basics movement with no SQL, no ACID, key-value pairs instead of schema, MapReduce instead of fancy but hard-to-follow parallel execution plans. On the other side, the database space specializes more and more; it is no longer simply transactions vs. analytics, but many more points of specialization. Some frustration can be sensed in the ivory towers of science when it is seen that the ones most in need of database understanding in fact have the least. Google, Yahoo!, and Microsoft know what they are doing, with or without SQL, but the medium-size or fast-growing web sites seem to be in confusion when LAMP or Ruby or the scripting-du-jour can no longer cut it. Can somebody using a database be expected to understand how it works? I would say no, not in general. Can a database be expected to unerringly self-configure based on workload? Sure, a database can suggest layouts, but it ought not restructure itself on the spur of the moment under full load. It is safe to say that the community at large no longer believes in &quot;one size fits all&quot;. Since there is no general solution, there is a fragmented space of specific solutions. We will be looking at some of these issues in the following posts.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>I was at the <a href="http://vldb2009.org/" id="link-id0x77dd108">VLDB 2009</a> conference in Lyon, France. I will in the next few posts discuss some of the prominent themes and how they relate to our products or to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x1a765238">RDF</a> and <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x6966070">Linked Data</a>.</p>

<p>Firstly, RDF was as good as absent from the presentations and discussions we saw.  There were a few mentions in the panel on structured <a href="http://dbpedia.org/resource/Data" id="link-id0x3a536e8">data</a> on the web, however RDF was not in any way seen to be essential for this. There were also a couple of RDF mentions in questions at other sessions, but that was about it.</p>

<p>It is a common perception that RDF and database people do not talk with each other.  Evidence seems to bear this out.</p>


<p>As a database developer I did get a lot of readily applicable ideas from the VLDB talks.  These run across the whole range of DBMS topics, from <a href="http://dbpedia.org/resource/Data_compression" id="link-id0x6302f60">key compression</a> and <a href="http://dbpedia.org/resource/SQL" id="link-id0x69163c0">SQL</a> <a href="http://dbpedia.org/resource/Program_optimization" id="link-id0x63a5cf0">optimization</a>, to <a href="http://dbpedia.org/resource/Column-oriented_DBMS" id="link-id0x1b56daf8">column storage</a>, <a href="http://dbpedia.org/resource/Central_processing_unit" id="link-id0x57c6168">CPU</a> <a href="http://dbpedia.org/resource/Cache" id="link-id0x1c504710">cache</a> optimization, and the like.  In this sense, VLDB is directly relevant to all we do.  In a conversation, someone was mildly confused that I should on one hand mention I was doing RDF, and on the other hand also be concerned about database performance.  These things are not seen to belong together, even though making RDF do something useful certainly depends on a great deal of database optimization.</p>

<p>The question of all questions — that of infinite scale-out with complex queries, resilience, replication, and full database semantics — was strongly in the air.</p>

<p>But it was in the air more as a question than as an answer.  Not very much at all was said about the performance of distributed query plans, of <a href="http://dbpedia.org/resource/Two-phase_commit_protocol" id="link-id0x637c6b0">2pc</a> (<a href="http://dbpedia.org/resource/Two-phase_commit_protocol" id="link-id0x69386a8">two-phase commit</a>), of the impact of interconnect latency, and such things.  On the other hand, people were talking quite liberally about optimizing CPU cache and local multi-core execution, not to mention SQL plans and rewrites.  Also, almost nothing was said about transactions.</p>

<p>Still, there is bound to be a great deal of work in scale-out of complex workloads by any number of players. Either these things are all figured out and considered self-evidently trivial, or they are so hot that people will go there only by way of allusion and vague reference.  I think it is the latter.</p>

<p>By and large, we were confirmed in our understanding that infinite scale-out on the go, with redundancy, is the ticket, especially if one can offer complex queries and transactional semantics coupled with instant data loading and <a href="http://dbpedia.org/resource/Database_schema" id="link-id0x7f90a20">schema</a>-last.</p>

<p>Column storage and cache optimizations seem to come right after these.</p>

<p>Certainly the database space is diversifying.</p>

<p>
<a href="http://dbpedia.org/resource/MapReduce" id="link-id0x485bd40">MapReduce</a> was discussed quite a bit, as an intruder into what would be the database turf.  We have no great problem with MapReduce; we do that in SQL procedures if one likes to program in this way.  <a href="http://dbpedia.org/resource/Greenplum" id="link-id0x7cc58c8">Greenplum</a> also seems to have come by the same idea.</p>

<p>As said before, RDF and RDF reasoning were ignored.  Do these actually offer something to the database side?  Certainly for search, discovery, integration, and resource discovery, linked data has evident advantages.</p>

<p>Two points of the design space — the warehouse, and the web-scale key-value store — got a lot of attention.  Would I do either in RDF? RDF is a slightly different design space point, like key-value with complex queries — on the surface, a fusion of the two.  As opposed to RDF, the relational warehouse gains from fixed data-types and task-specific layout, whether row or column.  The key-value store gains from having a concept of a semi-structured record, a bit like the RDF subject of a triple, but now with ad-hoc (if any) secondary indices, and inline blobs.  The latter is much simpler and more compact than the generic RDF subject with graphs and all, and can be easily treated as a unit of version control and replication mastering.  RDF, being more generic and more normalized, is representationally neither as ad-hoc nor as compact.</p>

<p>But RDF will be the natural choice when complex queries and ad-hoc schema meet, for example in web-wide integrations of application data.</p>

<p>There seems to be a huge divide in understanding between database-developing people and those who would be using databases.  On one side, this has led to a back-to-basics movement with no SQL, no <a href="http://dbpedia.org/resource/ACID" id="link-id0x6390650">ACID</a>, key-value pairs instead of schema, MapReduce instead of fancy but hard-to-follow parallel execution plans.  On the other side, the database space specializes more and more; it is no longer simply transactions vs. analytics, but many more points of specialization.</p>

<p>Some frustration can be sensed in the ivory towers of science when it is seen that the ones most in need of database understanding in fact have the least.  <a href="http://dbpedia.org/resource/Google" id="link-id0x1af4e7e0">Google</a>, <a href="http://dbpedia.org/resource/Yahoo%21" id="link-id0x75145d8">Yahoo</a>!, and <a href="http://dbpedia.org/resource/Microsoft" id="link-id0x17bd7d90">Microsoft</a> know what they are doing, with or without SQL, but the medium-size or fast-growing web sites seem to be in confusion when <a href="http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29" id="link-id0x1bf238e0">LAMP</a> or <a href="http://dbpedia.org/resource/Ruby_programming_language" id="link-id0x6ca3848">Ruby</a> or the scripting-du-jour can no longer cut it.</p>

<p>Can somebody using a database be expected to understand how it works? I would say no, not in general.  Can a database be expected to unerringly self-configure based on workload?  Sure, a database can suggest layouts, but it ought not restructure itself on the spur of the moment under full load.</p>

<p>It is safe to say that the community at large no longer believes in &quot;one size fits all&quot;.  Since there is no general solution, there is a fragmented space of specific solutions.  We will be looking at some of these issues in the following posts.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-09-01#1573">
  <rss:title>Provenance and Reification in Virtuoso</rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-09-01T14:44:08Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">These days, data provenance is a big topic across the board, ranging from the linked data web, to RDF in general, to any kind of data integration, with or without RDF. Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc. Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata. And if they do, the approach is often a proprietary relational schema with web services in front. RDF and linked data principles could evidently be a great help. This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road. For now, I will talk about possible ways of dealing with provenance annotations in Virtuoso at a fairly technical level. If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph. Annotations can then be made on the graph. The graph IRI will simply occur as the subject of a triple in the same or some other graph. For example, all such annotations could go into a special annotations graph. On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme discussed in the Virtuoso documentation. If the query does not specify a graph, then triples in any graph will be considered when evaluating the query. One could write queries like — SELECT ?pub WHERE { GRAPH ?g { ?person foaf:knows ?contact } ?contact foaf:name &quot;Alice&quot; . ?g xx:has_publisher ?pub } This would return the publishers of graphs that assert that somebody knows Alice. Of course, the RDF reification vocabulary can be used as-is to say things about single triples. It is however very inefficient and is not supported by any specific optimization. Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it. If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity. Virtuoso&#39;s RDF_QUAD table can be altered to have more columns. The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns. A SQL update statement can be used to set values for these additional columns if one knows the G,S,P,O. Suppose we annotated each quad with the user who inserted it and a timestamp. These would be columns in the RDF_QUAD table. The next choice would be whether these were primary key parts or dependent parts. If primary key parts, these would be non-NULL and would occur on every index. The same quad would exist for each distinct user and time this quad had been inserted. For loading functions to work, these columns would need a default. In practice, we think that having such metadata as a dependent part is more likely, so that G,S,P,O are the unique identifier of the quad. Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed. In SPARQL, one could use an extension syntax like — SELECT * WHERE { ?person foaf:knows ?connection OPTION ( time ?ts ) . ?connection foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) } This would return everybody who knows Alice since a date more recent than 2009-08-08. This presupposes that the quad table has been extended with a datetime column. The OPTION (time ?ts) syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of RDF_QUAD via a column ?variable syntax in the OPTION clause. If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of GSPO and a dependent part of R, where R would be the reification URI of the quad. Reification statements would then be made with R as a subject. This would be more compact than the reification vocabulary and would not modify the RDF_QUAD table. The syntax for referring to this could be something like — SELECT * WHERE { ?person foaf:knows ?contact OPTION ( reify ?r ) . ?r xx:assertion_time ?ts . ?contact foaf:name &quot;Alice&quot; . FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) } We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary. But since it is so unwieldy I don&#39;t think there would be huge demand. Who knows? You tell us.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>These days, <a href="http://dbpedia.org/resource/Data" id="link-id0x37019c8">data</a> provenance is a big topic across the board, ranging from the <a href="http://dbpedia.org/resource/Linked_Data" id="link-id0x53c3620">linked data</a> <a href="http://dbpedia.org/resource/Giant_Global_Graph" id="link-id0x4aa3848">web</a>, to <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x385aff0">RDF</a> in general, to any kind of data integration, with or without RDF.  Especially with scientific data we encounter the need for metadata and provenance, repeatability of experiments, etc.  Data without context is worthless, yet the producers of said data do not always have a model or budget for metadata.  And if they do, the approach is often a proprietary relational schema with web services in front.</p>

<p>RDF and linked data principles could evidently be a great help.  This is a large topic that goes into the culture of doing science and will deserve a more extensive treatment down the road.</p>

<p>For now, I will talk about possible ways of dealing with provenance annotations in <a href="http://virtuoso.openlinksw.com" id="link-id0x51c4da0">Virtuoso</a> at a fairly technical level.</p>

<p>If data comes many-triples-at-a-time from some source (e.g., library catalogue, user of a social network), then it is often easiest to put the data from each source/user into its own graph.  Annotations can then be made on the graph.  The graph IRI will simply occur as the subject of a triple in the same or some other graph.  For example, all such annotations could go into a special annotations graph.</p>

<p>On the query side, having lots of distinct graphs does not have to be a problem if the index scheme is the right one, i.e., the 4 index scheme <a href="http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfindexes" id="link-id142a0798">discussed in the Virtuoso documentation</a>.  If the query does not specify a graph, then triples in any graph will be considered when evaluating the query.</p>


<p>One could write queries like —</p>

<blockquote>
 <code><pre>SELECT  ?pub 
  WHERE 
    { 
      GRAPH  ?g 
        { 
          ?person  foaf:knows  ?contact 
        } 
      ?contact  foaf:name         &quot;Alice&quot;  . 
      ?g        xx:has_publisher  ?pub 
    }</pre>
 </code>
</blockquote>

<p>This would return the publishers of graphs that assert that somebody knows Alice.</p>

<p>Of course, the <a href="http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#reification" id="link-id14fa9488">RDF reification vocabulary</a> can be used as-is to say things about single triples.  It is however very inefficient and is not supported by any specific optimization.  Further, reification does not seem to get used very much; thus there is no great pressure to specially optimize it.</p>

<p>If we have to say things about specific triples and this occurs frequently (i.e., for more than 10% or so of the triples), then modifying the quad table becomes an option. For all its inefficiency, the RDF reification vocabulary is applicable if reification is a rarity.</p>

<p>Virtuoso&#39;s <code>RDF_QUAD</code> table can be altered to have more columns.  The problem with this is that space usage is increased and the RDF loading and query functions will not know about the columns.  A <a href="http://dbpedia.org/resource/SQL" id="link-id0x4784bf0">SQL</a> update statement can be used to set values for these additional columns if one knows the <code>G,S,P,O</code>. </p>

<p>Suppose we annotated each quad with the user who inserted it and a timestamp.  These would be columns in the <code>RDF_QUAD</code> table.  The next choice would be whether these were primary key parts or dependent parts.  If primary key parts, these would be non-<code>NULL</code> and would occur on every index.  The same quad would exist for each distinct user and time this quad had been inserted.  For loading functions to work, these columns would need a default.  In practice, we think that having such metadata as a dependent part is more likely, so that <code>G,S,P,O</code> are the unique identifier of the quad.  Whether one would then include these columns on indices other than the primary key would depend on how frequently they were accessed.</p>

<p>In <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x4a8a7c0">SPARQL</a>, one could use an extension syntax like —</p>

<blockquote>
 <code><pre>SELECT  * 
  WHERE 
    { ?person      foaf:knows  ?connection 
                   OPTION ( time  ?ts )     . 
      ?connection  foaf:name   &quot;Alice&quot;      . 
      FILTER ( ?ts &gt; &quot;2009-08-08&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>This would return everybody who knows Alice since a date more recent than 2009-08-08.  This presupposes that the quad table has been extended with a datetime column.</p>

<p>The <code>OPTION (time ?ts)</code> syntax is not presently supported but we can easily add something of the sort if there is user demand for it. In practice, this would be an extension mechanism enabling one to access extension columns of <code>RDF_QUAD</code> via a column <code>?variable</code> syntax in the <code>OPTION</code> clause.</p>


<p>If quad metadata were not for every quad but still relatively frequent, another possibility would be making a separate table with a key of <code>GSPO</code> and a dependent part of <code>R</code>, where <code>R</code> would be the reification <a href="http://dbpedia.org/resource/Uniform_Resource_Identifier" id="link-id0x49e6108">URI</a> of the quad.  Reification statements would then be made with <code>R</code> as a subject.  This would be more compact than the reification vocabulary and would not modify the <code>RDF_QUAD</code> table.   The syntax for referring to this could be something like —</p>

<blockquote>
 <code><pre>SELECT * 
  WHERE 
    { ?person   foaf:knows         ?contact 
                OPTION ( reify  ?r )          . 
      ?r        xx:assertion_time  ?ts       . 
      ?contact  foaf:name          &quot;Alice&quot;   . 
      FILTER ( ?ts &gt; &quot;2008-8-8&quot;^^xsd:datetime ) 
    }</pre>
 </code>
</blockquote>

<p>We could even recognize the reification vocabulary and convert it into the reify option if this were really necessary.  But since it is so unwieldy I don&#39;t think there would be huge demand.  Who knows?  You tell us.</p>]]></content:encoded>
 </rss:item>
 <rss:item xmlns:rss="http://purl.org/rss/1.0/" rdf:about="http://virtuoso.openlinksw.com/blog/vdb/blog/?date=2009-08-19#1571">
  <rss:title>More On Parallel RDF/Text Query Evaluation </rss:title>
  <dc:date xmlns:dc="http://purl.org/dc/elements/1.1/">2009-08-19T17:28:50Z</dc:date>
  <dc:description xmlns:dc="http://purl.org/dc/elements/1.1/">We have received some more questions about Virtuoso&#39;s parallel query evaluation model. In answer, we will here explain how we do search engine style processing by writing SPARQL. There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce. The point is that what used to require programming can often be done in a generic query language. The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit. But by combining these two things, we are a step closer to the web being the database. I will here show how we do some joins combining full text, RDF conditions, and aggregates and ORDER BY. The sample task is finding the top 20 entities with New York in some attribute value. Then we specify the search further by only taking actors associated with New York. The results are returned in the order of a composite of entity rank and text match score. The basic query is: SELECT ( sql:s_sum_page ( &lt;sql:vector_agg&gt; ( &lt;bif:vector&gt; ( ?c1 , ?sm ) ), bif:vector ( &#39;new&#39;, &#39;york&#39; ) ) ) AS ?res WHERE { { SELECT ( &lt;SHORT_OR_LONG::&gt;(?s1) ) AS ?c1 ( &lt;sql:S_SUM&gt; ( &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 ) , &lt;SHORT_OR_LONG::&gt; ( ?s1textp ) , &lt;SHORT_OR_LONG::&gt; ( ?o1 ) , ?sc ) ) AS ?sm WHERE { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) } ORDER BY DESC ( &lt;sql:sum_rank&gt; (( &lt;sql:S_SUM&gt; ( &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 ) , &lt;SHORT_OR_LONG::&gt; ( ?s1textp ) , &lt;SHORT_OR_LONG::&gt; ( ?o1 ) , ?sc ) )) ) LIMIT 20 } } This takes some explaining. The basic part is { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) } This just makes tuples where ?s1 is the object, ?s1textp the property, and ?o1 the literal which contains &quot;New York&quot;. For a single ?s1, there can of course be many properties which all contain &quot;New York&quot;. The rest of the query gathers all the &quot;New York&quot; containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities. After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between &quot;New York&quot; and the strings containing &quot;New York&quot;. The text hit score is higher if the words repeat often and in close proximity. The s_sum function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score. These are grouped by the subject of the triple. After this, these are sorted by sum_score of the aggregate constructed with s_sum. The sum_score is a SQL function combining the entity rank with the text scores of the different literals. This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple. The text index entries of an object are stored in the same partition as the object. But the entity rank is a property of the subject and is partitioned by the subject. Also the GROUP BY is by the subject. Thus the data is produced from all partitions, then streamed into the receiving partitions, determined by the subject. This partition can then get the score and group the matches by the subject. Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top k sort can be done for each partition separately. Finally, the top 20 of each partition are merged into the global top 20. This is then passed to a final function s_sum_page that turns this all into an XML fragment that can be processed with XSLT for inclusion on a web page. This differs from the text search engine in that the query pipeline can contain arbitrary cross-partition joins. Also, the string &quot;New York&quot; is a common label that occurs in many distinct entities. Thus one text match, to one document, in the case the containing only the string &quot;New York&quot; will get many entities, likely all from different partitions. So, if we only want actors with a mention of &quot;New York&quot;, we need to get the inner part of the query as: { ?s1 ?s1textp ?o1 . ?o1 bif:contains &quot;new AND york&quot; OPTION ( SCORE ?sc ) . ?s1 a &lt;http://umbel.org/umbel/sc/Actor&gt; } Whether an entity is an actor can be checked in the same partition as the rank of the entity. Thus the query plan gets this check right before getting the rank. This is natural since there is no point in getting the rank of something that is not an actor. The &lt;short_or_long::sql:func&gt; notation means that we call func, which is a SQL stored procedure with the arguments in their internal form. Thus, if a variable bound to an IRI is passed, the short_or_long specifies that it is passed as its internal ID and is not converted into its text form. This is essential, since there is no point getting the text of half a million IRIs when only 20 at most will be shown in the end. Now, when we run this on a collection of 4.5 billion triples of linked data, once we have the working set, we can get the top 20 &quot;New York&quot; occurrences, with text summaries and all, in just 1.1s, with 12 of 16 cores busy. (The hardware is two boxes with two quad-core Xeon 5345 each.) If we run this query in two parallel sessions, we get both results in 1.9s, with 14 of 16 cores busy. This gets about 200K &quot;New York&quot; strings, which becomes about 400K entities with New York somewhere, for which a rank then has to be retrieved. After this, all the possibly-many occurrences of New York in the title, text, and other properties of the entity are aggregated together, resolving into some 220K groups. These are then sorted. This is internally over 1.5 million random lookups and some 40MB of traffic between processes. Restricting the type of the entity to actor drops the execution time of one query to 0.8s because there are then fewer ranks to retrieve and less data to aggregate and sort. By adding partitions and cores, we scale horizontally, as evaluating the query involves almost no central control, even though data are swapped between partitions. There is some flow control to avoid constructing overly-large intermediate results but generally partitions run independently and asynchronously. In the above case, there is just one fence at the point where all aggregates are complete, so that they can be sorted; otherwise, all is asynchronous. Doing JOINs between partitions and partitioned GROUP BY/ORDER BY is pretty regular database stuff. Applying this to RDF is a most natural thing. If we do not parallelize the user-defined aggregate for grouping all the &quot;New York&quot; occurrences, the query takes 8s instead of 1.1s. If we could not put SQL procedures as user-defined aggregates to be parallelized with the query, we&#39;d have to either bring all the data to a central point before the top k, which would destroy performance, or we would have to do procedures with explicit parallel procedure calls which is hard to write, surely too hard for ad hoc queries. Results of live execution may not be complete on initial load, as this link includes a &quot;Virtuoso Anytime&quot; timeout of 10 seconds. Running against a cold cache, these results may take much longer to return; a warm cache will deliver response times along the lines of those discussed above. Engineering matters. If we wish to commoditize queries on a lot of data, such intelligence in the DBMS is necessary; it is very unscalable to require people to do procedural code or give query parallelization hints. If you need to optimize a workload of 10 different transactions, this is of course possible and even desirable, but for the infinity of all search or analysis, this will not happen.</dc:description>
  <content:encoded xmlns:content="http://purl.org/rss/1.0/modules/content/"><![CDATA[<p>We have received some more questions about <a href="http://virtuoso.openlinksw.com" id="link-id0x15ca9a30">Virtuoso</a>&#39;s parallel query evaluation model.</p>

<p>In answer, we will here explain how we do search engine style processing by writing <a href="http://dbpedia.org/resource/SPARQL" id="link-id0x1574c560">SPARQL</a>.  There is no need for custom procedural code because the query optimizer does all the partitioning and the equivalent of map reduce.</p>

<p>The point is that what used to require programming can often be done in a generic query language.  The technical detail is that the implementation must be smart enough with respect to parallelizing queries for this to be of practical benefit.  But by combining these two things, we are a step closer to the web being the database.</p>

<p>I will here show how we do some joins combining full text, <a href="http://dbpedia.org/resource/Resource_Description_Framework" id="link-id0x15949970">RDF</a> conditions, and aggregates and <code>ORDER BY</code>.  The sample task is finding the top 20 entities with New York in some attribute value.  Then we specify the search further by only taking actors associated with New York.  The results are returned in the order of a composite of <a href="http://dbpedia.org/resource/Entity" id="link-id0x213bf310">entity</a> rank and text match score.</p>
 
<p>The basic query is:</p>

<blockquote>
 <code><pre>
SELECT 
  ( 
    <a href="http://dbpedia.org/resource/SQL" id="link-id0x23632230">sql</a>:s_sum_page 
      ( 
        &lt;sql:vector_agg&gt; 
          (
            &lt;bif:vector&gt; ( ?c1 , ?sm )
          ), 
        bif:vector 
          ( &#39;new&#39;, &#39;york&#39; )
      )
  ) AS ?res
WHERE 
  {
    { 
      SELECT 
        ( 
          &lt;SHORT_OR_LONG::&gt;(?s1) 
        ) AS ?c1
        ( 
          &lt;sql:S_SUM&gt; 
            (
               &lt;SHORT_OR_LONG::IRI_RANK&gt;  ( ?s1 )      ,
               &lt;SHORT_OR_LONG::&gt;          ( ?s1textp ) ,
               &lt;SHORT_OR_LONG::&gt;          ( ?o1 ) ,
               ?sc 
             )
         ) AS ?sm
      WHERE 
        { 
          ?s1  ?s1textp      ?o1             . 
          ?o1  bif:contains  &quot;new AND york&quot; 
            OPTION ( SCORE ?sc )
        }
      ORDER BY 
        DESC 
          ( 
            &lt;sql:sum_rank&gt; 
              ((
                 &lt;sql:S_SUM&gt; 
                   (
                     &lt;SHORT_OR_LONG::IRI_RANK&gt; ( ?s1 )      ,
                     &lt;SHORT_OR_LONG::&gt;         ( ?s1textp ) ,
                     &lt;SHORT_OR_LONG::&gt;         ( ?o1 )      ,
                     ?sc 
                   ) 
              )) 
          ) 
        LIMIT 20 
    } 
  }
</pre>
 </code>
</blockquote>

<p>This takes some explaining.  The basic part is</p>

<blockquote>
 <code><pre>{ 
  ?s1  ?s1textp      ?o1             . 
  ?o1  bif:contains  &quot;new AND york&quot;  
    OPTION ( SCORE ?sc )
}</pre>
 </code>
</blockquote>
          
<p>This just makes tuples where <code>?s1</code> is the object, <code>?s1textp</code> the property, and <code>?o1</code> the literal which contains &quot;New York&quot;.  For a single <code>?s1</code>, there can of course be many properties which all contain &quot;New York&quot;.</p>

<p>The rest of the query gathers all the &quot;New York&quot; containing properties of an entity into a single aggregate, and then gets the entity ranks of all such entities.</p>

<p>After this, the aggregates are sorted by a sum of the entity rank and a combined text score calculated based on the individual text match scores between &quot;New York&quot; and the strings containing &quot;New York&quot;.  The text hit score is higher if the words repeat often and in close proximity.</p>

<p>The <code>s_sum</code> function is a user-defined aggregate which takes 4 arguments: The rank of the subject of the triple; the predicate of the triple containing the text; the object of the triple containing the text; and the text match score.</p>

<p>These are grouped by the subject of the triple.  After this, these are sorted by <code>sum_score</code> of the aggregate constructed with <code>s_sum</code>.  The <code>sum_score</code> is a SQL function combining the entity rank with the text scores of the different literals.</p>

<p>This executes as one would expect: All partitions make a text index lookup, retrieving the object of the triple.  The text index entries of an object are stored in the same partition as the object.  But the entity rank is a property of the subject and is partitioned by the subject.  Also the <code>GROUP BY</code> is by the subject.  Thus the <a href="http://dbpedia.org/resource/Data" id="link-id0x15da01b8">data</a> is produced from all partitions, then streamed into the receiving partitions, determined by the subject.  This partition can then get the score and group the matches by the subject.  Since all these partial aggregates are partitioned by the subject, there is no need to merge them; thus, the top <code>k</code> sort can be done for each partition separately.  Finally, the top 20 of each partition are merged into the global top 20.  This is then passed to a final function <code>s_sum_page</code> that turns this all into an <a href="http://dbpedia.org/resource/XML" id="link-id0x15d59fc8">XML</a> fragment that can be processed with XSLT for inclusion on a we