Details

Virtuso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
In Hoc Signo Vinces (part 11 of n): TPC-H Q2, Q10 - Late Projection

Analytics is generally about making something small out of something large. This reduction is obtained by a TOP k operator (i.e., show only the 10 best by some metric) and/or by grouping and aggregation (i.e., for a set of items, show some attributes of these items and a sum, count, or other aggregate of dependent items for each).

In this installment we will look at late projection, also sometimes known as late materialization. If many attributes are returned and there is a cutoff of some sort, then the query does not need to be concerned about attributes on which there are no conditions, except for fetching them at the last moment, only for the entities which in fact will be returned to the user.

We look at TPC-H Q2 and Q10.

Q2:

  SELECT  TOP 100
                   s_acctbal,
                   s_name,
                   n_name,
                   p_partkey,
                   p_mfgr,
                   s_address,
                   s_phone,
                   s_comment
    FROM  part,
          supplier,
          partsupp,
          nation,
          region
   WHERE  p_partkey = ps_partkey
     AND  s_suppkey = ps_suppkey
     AND  p_size = 15
     AND  p_type LIKE '%BRASS'
     AND  s_nationkey = n_nationkey
     AND  n_regionkey = r_regionkey
     AND  r_name = 'EUROPE'
     AND  ps_supplycost = 
            ( SELECT  MIN(ps_supplycost)
                FROM  partsupp,
                      supplier,
                      nation,
                      region
               WHERE  p_partkey = ps_partkey
                 AND  s_suppkey = ps_suppkey
                 AND  s_nationkey = n_nationkey
                 AND  n_regionkey = r_regionkey
                 AND  r_name = 'EUROPE'
            )
ORDER BY  s_acctbal DESC,
          n_name,
          s_name,
          p_partkey

The intent is to return information about parts and suppliers, such that the part is available from a supplier in Europe, and the supplier has the lowest price for the part among all European suppliers.

Q10:

  SELECT  TOP 20
                                                     c_custkey,
                                                     c_name,
          SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                                                     c_acctbal, 
                                                     n_name,
                                                     c_address,
                                                     c_phone,
                                                     c_comment
    FROM  customer,
          orders,
          lineitem,
          nation
   WHERE  c_custkey = o_custkey
     AND  l_orderkey = o_orderkey
     AND  o_orderdate >= CAST ('1993-10-01' AS DATE)
     AND  o_orderdate < DATEADD ('month', 3, CAST ('1993-10-01' AS DATE))
     AND  l_returnflag = 'R'
     AND  c_nationkey = n_nationkey
GROUP BY  c_custkey,
          c_name,
          c_acctbal,
          c_phone,
          n_name,
          c_address,
          c_comment
ORDER BY  revenue DESC

The intent is to list the customers who cause the greatest loss of revenue in a given quarter by returning items ordered in said quarter.

We notice that both queries return many columns on which there are no conditions, and that both have a cap on returned rows. The difference is that in Q2 the major ORDER BY is on a grouping column, and in Q10 it is on the aggregate of the GROUP BY. Thus the TOP k trick discussed in the previous article does apply to Q2 but not to Q10.

The profile for Q2 follows:

{ 
time   6.1e-05% fanout         1 input         1 rows
time       1.1% fanout         1 input         1 rows
{ hash filler
Subquery 27 
{ 
time    0.0012% fanout         1 input         1 rows
REGION         1 rows(t10.R_REGIONKEY)
 R_NAME = <c EUROPE>
time   0.00045% fanout         5 input         1 rows
NATION         5 rows(t9.N_NATIONKEY)
 N_REGIONKEY = t10.R_REGIONKEY
time       1.6% fanout     40107 input         5 rows
SUPPLIER   4.2e+04 rows(t8.S_SUPPKEY)
 S_NATIONKEY = t9.N_NATIONKEY
 
After code:
      0: t8.S_SUPPKEY :=  := artm t8.S_SUPPKEY
      4: BReturn 0
time       0.1% fanout         0 input    200535 rows
Sort hf 49 (t8.S_SUPPKEY)
}
}
time    0.0004% fanout         1 input         1 rows
{ fork
time        21% fanout     79591 input         1 rows
PART     8e+04 rows(.P_PARTKEY)
 P_TYPE LIKE <c %BRASS> LIKE <c > ,  P_SIZE =  15 
time        44% fanout  0.591889 input     79591 rows
 
Precode:
      0: { 
time     0.083% fanout         1 input     79591 rows
time      0.13% fanout         1 input     79591 rows
{ fork
time        24% fanout  0.801912 input     79591 rows
PARTSUPP       3.5 rows(.PS_SUPPKEY, .PS_SUPPLYCOST)
 inlined  PS_PARTKEY = k_.P_PARTKEY
hash partition+bloom by 62 (tmp)hash join merged always card       0.2 -> ()
time       1.3% fanout         0 input     63825 rows
Hash source 49 merged into ts not partitionable       0.2 rows(.PS_SUPPKEY) -> ()
 
After code:
      0:  min min.PS_SUPPLYCOSTset no set_ctr
      5: BReturn 0
}
 
After code:
      0: aggregate :=  := artm min
      4: BReturn 0
time      0.19% fanout         0 input     79591 rows
Subquery Select(aggregate)
}
 
      8: BReturn 0
PARTSUPP     5e-08 rows(.PS_SUPPKEY)
 inlined  PS_PARTKEY = k_.P_PARTKEY PS_SUPPLYCOST = k_scalar
time       5.9% fanout  0.247023 input     47109 rows
SUPPLIER unq       0.9 rows (.S_ACCTBAL, .S_NATIONKEY, .S_NAME, .S_SUPPKEY)
 inlined  S_SUPPKEY = .PS_SUPPKEY
top k on S_ACCTBAL
time     0.077% fanout         1 input     11637 rows
NATION unq         1 rows (.N_REGIONKEY, .N_NAME)
 inlined  N_NATIONKEY = .S_NATIONKEY
time     0.051% fanout         1 input     11637 rows
REGION unq       0.2 rows ()
 inlined  R_REGIONKEY = .N_REGIONKEY R_NAME = <c EUROPE>
time      0.42% fanout         0 input     11637 rows
Sort (.S_ACCTBAL, .N_NAME, .S_NAME, .P_PARTKEY) -> (.S_SUPPKEY)
 
}
time    0.0016% fanout       100 input         1 rows
top order by read (.S_SUPPKEY, .P_PARTKEY, .N_NAME, .S_NAME, .S_ACCTBAL)
time      0.02% fanout         1 input       100 rows
PART unq      0.95 rows (.P_MFGR)
 inlined  P_PARTKEY = .P_PARTKEY
time     0.054% fanout         1 input       100 rows
SUPPLIER unq         1 rows (.S_PHONE, .S_ADDRESS, .S_COMMENT)
 inlined  S_SUPPKEY = k_.S_SUPPKEY
time   6.7e-05% fanout         0 input       100 rows
Select (.S_ACCTBAL, .S_NAME, .N_NAME, .P_PARTKEY, .P_MFGR, .S_ADDRESS, .S_PHONE, .S_COMMENT)
}


 128 msec 1007% cpu,    196992 rnd 2.53367e+07 seq   50.4135% same seg   45.3574% same pg 

The query starts with a scan looking for the qualifying parts. It then looks for the best price for each part from a European supplier. All the European suppliers have been previously put in a hash table by the hash filler subquery at the start of the plan. Thus, to find the minimum price, the query takes the partsupp for the part by index, and then eliminates all non-European suppliers by a selective hash join. After this, there is a second index lookup on partsupp where we look for the part and the price equal to the minimum price found earlier. These operations could in principle be merged, as the minimum price partsupp has already been seen. The gain would not be very large, though.

Here we note that the cost model guesses that very few rows will survive the check of ps_supplycost = minimum cost. It does not know that the minimum is not just any value, but one of the values that do occur in the ps_supplycost column for the part. Because of this, the remainder of the plan is carried out by index, which is just as well. The point is that if very few rows of input are expected, it is not worthwhile to make a hash table for a hash join. The hash table made for the European suppliers could be reused here, maybe with some small gain. It would however need more columns, which might make it not worthwhile. We note that the major order with the TOP k is on the supplier s_acctbal, hence as soon as there are 100 suppliers found, one can add a restriction on the s_acctbal for subsequent ones.

At the end of the plan, after the TOP k ORDER BY and the reading of the results, we have a separate index-based lookup for getting only the columns that are returned. We note that this is done on 100 rows whereas the previous operations are done on tens-of-thousands of rows. The TOP k restriction produces some benefit, but it is relatively late in the plan, and not many operations follow it.

The plan is easily good enough, with only small space for improvement. Q2 is one of the fastest queries of the set.

Let us now consider the execution of Q10:

{ 
time   1.1e-06% fanout         1 input         1 rows
time   4.4e-05% fanout         1 input         1 rows
{ hash filler
time   1.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, .N_NAME)
 
time   6.7e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (.N_NAME)
 
}
time   1.5e-06% fanout         1 input         1 rows
{ fork
time   2.4e-06% fanout         1 input         1 rows
{ fork
time        13% fanout 5.73038e+06 input         1 rows
ORDERS   5.1e+06 rows(.O_ORDERKEY, .O_CUSTKEY)
 O_ORDERDATE >= <c 1993-10-01> < <c 1994-01-01>
time       4.8% fanout   2.00042 input 5.73038e+06 rows
LINEITEM       1.1 rows(.L_EXTENDEDPRICE, .L_DISCOUNT)
 inlined  L_ORDERKEY = .O_ORDERKEY L_RETURNFLAG = <c R>
time        25% fanout         1 input 1.14632e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
CUSTOMER unq         1 rows (.C_NATIONKEY, .C_CUSTKEY)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
hash partition+bloom by 39 (tmp)hash join merged always card         1 -> (.N_NAME)
time    0.0023% fanout         1 input 1.14632e+07 rows
Hash source 35 merged into ts          1 rows(.C_NATIONKEY) -> (.N_NAME)
time       2.3% fanout         1 input 1.14632e+07 rows
Stage 2
time       3.6% fanout         0 input 1.14632e+07 rows
Sort (q_.C_CUSTKEY, .N_NAME) -> (temp)
 
}
time       0.6% fanout 3.88422e+06 input         1 rows
group by read node  
(.C_CUSTKEY, .N_NAME, revenue)in each partition slice
time      0.57% fanout         0 input 3.88422e+06 rows
Sort (revenue) -> (.N_NAME, .C_CUSTKEY)
 
}
time   6.9e-06% fanout        20 input         1 rows
top order by read (.N_NAME, revenue, .C_CUSTKEY)
time   0.00036% fanout         1 input        20 rows
CUSTOMER unq         1 rows (.C_PHONE, .C_NAME, .C_ACCTBAL, .C_ADDRESS, .C_COMMENT)
 inlined  C_CUSTKEY = .C_CUSTKEY
time   1.1e-06% fanout         0 input        20 rows
Select (.C_CUSTKEY, .C_NAME, revenue, .C_ACCTBAL, .N_NAME, .C_ADDRESS, .C_PHONE, .C_COMMENT)
}


 2153 msec 2457% cpu, 1.71845e+07 rnd 1.67177e+08 seq   76.3221% same seg   21.1204% same pg 

The plan is by index, except for the lookup of nation name for the customer. The most selective condition is on order date, followed by the returnflag on lineitem. Getting the customer by index turns out to be better than by hash, even though almost all customers are hit. See the input cardinality above the first customer entry in the plan -- over 10M. The key point here is that only the c_custkey and c_nationkey get fetched, which saves a lot of time. In fact the c_custkey is needless since this is anyway equal to the o_custkey, but this makes little difference.

One could argue that customer should be between lineitem and orders in join order. Doing this would lose the ORDER BY on orders and lineitem, but would prevent some customer rows from being hit twice for a single order. The difference would not be large, though. For a scale-out setting, one definitely wants to have orders and lineitem without customer in between if the former are partitioned on the same key.

The c_nationkey is next translated into a n_name by hash, and there is a partitioned GROUP BY on c_custkey. The GROUP BY is partitioned because there are many different c_custkey values (155M for 100G scale).

The most important trick is fetching all the many dependent columns of c_custkey only after the TOP k ORDER BY. The last access to customer in the plan does this and is only executed on 20 rows.

Without the TOP k trick, the plan is identical, except that the dependent columns are fetched for nearly all customers. If this is done, the run time is 16s, which is bad enough to sink the whole score.

There is another approach to the challenge of this query: If foreign keys are declared and enforced, the system will know that every order has an actually existing customer and that every customer has a country. If so, the whole GROUP BY and TOP k can be done without any reference to customer, which is a notch better still, at least for this query. In this implementation, we do not declare foreign keys, thus the database must check that the customer and its country in fact exist before doing the GROUP BY. This makes the late projection trick mandatory, but does save the expense of checking foreign keys on updates. In both cases, the optimizer must recognize that the columns to be fetched at the end (late projected) are functionally dependent on a grouping key (c_custkey).

The late projection trick is generally useful, since almost all applications aside from bulk data export have some sort of limit on result set size. A column store especially benefits from this, since some columns of a row can be read without even coming near to other ones. A row store can also benefit from this in the form of decreased intermediate result size. This is especially good when returning long columns, such as text fields or blobs, on which there are most often no search conditions. If there are conditions of such, then these will most often be implemented via a special text index and not a scan.

*           *           *           *           *

In the next installment we will have a look at the overall 100G single server performance. After this we will recap the tricks so far. Then it will be time to look at implications of scale out for performance and run at larger scales. After the relational ground has been covered, we can look at implications of schema-lastness, i.e., triples for this type of workload.

So, while the most salient tricks have been at least briefly mentioned, we are far from having exhausted this most foundational of database topics.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
04/07/2014 12:30 GMT-0500
OpenPHACTS in Vienna

Hugh Williams and I (Orri Erling) went to the Open PHACTS Steering Committee meeting in Vienna last week. I am a great fan of Open PHACTS; the meetings are fun, with a great team spirit, and there is always something new to learn.

Paul Groth gave a talk about the stellar success of the the initial term of Open PHACTS.

  • Three releases of platform and data
  • 18 applications
  • Open PHACTS Foundation for sustainable exploitation and further development of the platform
  • Superb culture of collaboration
    • great team spirit
    • great output from distributed organization
    • lots of face-to-face time
    • example to every other big collaborative project

"The reincarnation of Steve Jobs," commented someone from the audience. "Except I am a nice guy," retorted Paul.

Commented one attendee, "The semantic web…., I just was in Boston at a semantic web meeting – so nerdy, something to make you walk out of the room… so it is a definite victory for Open PHACTS and why not also semantic web, that something based on these principles actually works."

It is a win anyhow, so I did not say anything at the meeting. So I will say something here, where I have more space as the message bears repeating.

We share part of the perception, so we hardly ever say "semantic web." The word is "linked data," and it means flexible schema and global identifiers. Flexible schema means that everything does not have to be modeled upfront. Global identifiers means that data, when transferred out of its silo of origin, remains interpretable and self-describing, so you can mix it with other data without things getting confused. "Desiloization" is a wonderful new word for describing this.

This ties right into FAIRport and FAIR data: Findable, Accessible, Interoperable, Reusable. Barend Mons talked a lot about this: open just means downloadable; fair means something you can do science with. Barend’s take is that RDF with a URI for everything is the super wire format for exchanging data. When you process it, you will diversely cook it, so an RDF store is one destination but not the only possibility. It has been said before: there is a range of choices between storing triples verbatim, and making application specific extractions, including ones with a schema, whether graph DB or relational.

Nanopublications are also moving ahead. Christine Chichester told me about pending publications involving Open PHACTS nanopublictions about post-translation modification of proteins and their expression in different tissues. So there are nanopublications out there and they can be joined, just as intended. Victory of e-science and data integration.

The Open PHACTS project is now officially extended for another two-year term, bringing the total duration to five years. The Open PHACTS Foundation exists as a legal entity and has its first members. This is meant to be a non-profit industry association for sharing of pre-competitive data and services around these between players in the pharma space, in industry as well as academia. There are press releases to follow in due time.

I am looking forward to more Open PHACTS. From the OpenLink and Virtuoso side, there are directly relevant developments that will enter production in the next few months, including query caching discussed earlier on this blog, as well as running on the TPC-H tuned analytics branch for overall better query optimization. Adaptive schema is something of evident value to Open PHACTS, as much of the integrated data comes from relational sources, so is regular enough. Therefore taking advantage of this for storage cannot hurt. We will see this still within the scope of the project extension.

Otherwise, more cooperation in formulating the queries for the business questions will also help.

All in all, Open PHACTS is the celebrated beauty queen of all the Innovative Medicine Initiative, it would seem. Superbly connected, unparalleled logo cloud, actually working and useful data integration, delivering on time on all in fact very complex business questions.

# PermaLink Comments [0]
03/31/2014 11:49 GMT-0500
In Hoc Signo Vinces (part 10 of n): TPC-H Q9, Q17, Q20 - Predicate Games

TPC-H is a hash join game. The rules do allow indices, but maintaining these takes time, and indices will quickly result in non-local access patterns. Indices also take space. Besides, somebody must know what indices to create, which is not obvious. Thus, it is best if a BI data warehouse works without.

Once you go to hash join, one side of the join will be materialized, which takes space, which ipso facto is bad. So, the predicate games are about moving conditions so that the hash table made for the hash join will be as small as possible. Only items that may in fact be retrieved should be put in the hash table. If you know that the query deals with shipments of green parts, putting lineitems of parts that are not green in a hash table makes no sense since only green ones are being looked for.

So, let's consider Q9. The query is:

SELECT                 nation,
                       o_year,
        SUM(amount) AS sum_profit
 FROM  ( SELECT
                                                                          n_name AS nation,
                                               EXTRACT ( YEAR FROM o_orderdate ) AS o_year,
                 l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount
           FROM
                 part,
                 supplier,
                 lineitem,
                 partsupp,
                 orders,
                 nation
          WHERE  s_suppkey = l_suppkey
            AND  ps_suppkey = l_suppkey
            AND  ps_partkey = l_partkey
            AND  p_partkey = l_partkey
            AND  o_orderkey = l_orderkey
            AND  s_nationkey = n_nationkey
            AND  p_name like '%green%'
       ) AS profit
GROUP BY  nation,
          o_year
ORDER BY  nation,
          o_year DESC
;

The intent is to calculate profit from the sale of a type of part, broken down by year and supplier nation. All orders, lineitems, partsupps, and suppliers involving the parts of interest are visited. This is one of the longest running of the queries. The query is restricted by part only, and the condition selects 1/17 of all parts.

The execution plan is below. First the plan builds hash tables of all nations and suppliers. We expect to do frequent lookups, thus making a hash is faster than using the index. Partsupp is the 3rd largest table in the database. This has a primary key of ps_partkey, ps_suppkey, referenced by the compound foreign key l_partkey, l_suppkey in lineitem. This could be accessed by index, but we expect to hit each partsupp row multiple times, hence hash is better. We further note that only partsupp rows where the part satisfies the condition will contribute to the result. Thus we import the join with part into the hash build. The ps_partkey is not directly joined to p_partkey, but rather the system must understand that this follows from l_partkey = ps_partkey and l_partkey = p_partkey. In this way, the hash table is 1/17th of the size it would otherwise be, which is a crucial gain.

Looking further into the plan, we note a scan of lineitem followed by a hash join with part. Restricting the build of the partsupp hash would have the same effect, hence part is here used twice while it occurs only once in the query. This is deliberate, since the selective hash join with part restricts lineitem faster than the more complex hash join with a 2 part key (l_partkey, l_suppkey). Both joins perform the identical restriction, but doing the part first is faster since this becomes a single-key, invisible hash join, merged into the lineitem scan, done before even accessing the l_suppkey and other columns.

{ 
time   3.9e-06% fanout         1 input         1 rows
time   4.7e-05% fanout         1 input         1 rows
{ hash filler
time   3.6e-05% fanout        25 input         1 rows
NATION        25 rows(.N_NATIONKEY, nation)
 
time   8.8e-06% fanout         0 input        25 rows
Sort hf 35 (.N_NATIONKEY) -> (nation)
 
}
time      0.16% fanout         1 input         1 rows
{ hash filler
time     0.011% fanout     1e+06 input         1 rows
SUPPLIER     1e+06 rows(.S_SUPPKEY, .S_NATIONKEY)
 
time      0.03% fanout         0 input     1e+06 rows
Sort hf 49 (.S_SUPPKEY) -> (.S_NATIONKEY)
 
}
time      0.57% fanout         1 input         1 rows
{ hash filler
Subquery 58 
{ 
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(t1.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time       1.1% fanout         4 input 1.17076e+06 rows
PARTSUPP       3.9 rows(t4.PS_SUPPKEY, t4.PS_PARTKEY, t4.PS_SUPPLYCOST)
 inlined  PS_PARTKEY = t1.P_PARTKEY
 
After code:
      0: t4.PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: t4.PS_PARTKEY :=  := artm t4.PS_PARTKEY
      8: t1.P_PARTKEY :=  := artm t1.P_PARTKEY
      12: t4.PS_SUPPLYCOST :=  := artm t4.PS_SUPPLYCOST
      16: BReturn 0
time      0.33% fanout         0 input 4.68305e+06 rows
Sort hf 82 (t4.PS_SUPPKEY, t4.PS_PARTKEY) -> (t1.P_PARTKEY, t4.PS_SUPPLYCOST)
 
}
}
time      0.18% fanout         1 input         1 rows
{ hash filler
time       1.6% fanout 1.17076e+06 input         1 rows
PART   1.2e+06 rows(.P_PARTKEY)
 P_NAME LIKE  <c %green%> LIKE  <c >
time     0.017% fanout         0 input 1.17076e+06 rows
Sort hf 101 (.P_PARTKEY)
}
time   5.1e-06% fanout         1 input         1 rows
{ fork
time   4.1e-06% fanout         1 input         1 rows
{ fork
time        59% fanout 3.51125e+07 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_ORDERKEY, .L_SUPPKEY, .L_EXTENDEDPRICE, .L_DISCOUNT, .L_QUANTITY)
 
hash partition+bloom by 108 (tmp)hash join merged always card     0.058 -> ()
hash partition+bloom by 56 (tmp)hash join merged always card         1 -> (.S_NATIONKEY)
time      0.18% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm  1  - .L_DISCOUNT
      4: temp := artm .L_EXTENDEDPRICE * temp
      8: BReturn 0
Hash source 101 merged into ts      0.058 rows(.L_PARTKEY) -> ()
time        17% fanout         1 input 3.51125e+07 rows
Hash source 82       0.057 rows(.L_SUPPKEY, .L_PARTKEY) -> (  <none> , .PS_SUPPLYCOST)
time       6.2% fanout         1 input 3.51125e+07 rows
 
Precode:
      0: temp := artm .PS_SUPPLYCOST * .L_QUANTITY
      4: temp := artm temp - temp
      8: BReturn 0
ORDERS unq         1 rows (.O_ORDERDATE)
 inlined  O_ORDERKEY = k_.L_ORDERKEY
time    0.0055% fanout         1 input 3.51125e+07 rows
Hash source 49 merged into ts          1 rows(k_.L_SUPPKEY) -> (.S_NATIONKEY)
time       3.5% fanout         1 input 3.51125e+07 rows
Hash source 35           1 rows(k_.S_NATIONKEY) -> (nation)
time       8.8% fanout         0 input 3.51125e+07 rows
 
Precode:
      0: o_year := Call year (.O_ORDERDATE)
      5: BReturn 0
Sort (nation, o_year) -> (temp)
 
}
time   4.7e-05% fanout       175 input         1 rows
group by read node  
(nation, o_year, sum_profit)
time   0.00028% fanout         0 input       175 rows
Sort (nation, o_year) -> (sum_profit)
 
}
time   2.2e-05% fanout       175 input         1 rows
Key from temp (nation, o_year, sum_profit)
 
time   1.6e-06% fanout         0 input       175 rows
Select (nation, o_year, sum_profit)
}


 6114 msec 1855% cpu, 3.62624e+07 rnd 6.44384e+08 seq   99.6068% same seg  0.357328% same pg 

6.1s is a good score for this query. When executing the same in 5 parallel invocations, the fastest ends in 13.7s and the slowest in 27.6s. For five concurrent executions, the peak transient memory utilization is 4.7 GB for the hash tables, which is very reasonable.

*           *           *           *           *

Let us next consider Q17.

SELECT
        SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM
        lineitem,
        part
WHERE
        p_partkey = l_partkey
   AND  p_brand = 'Brand#23'
   AND  p_container = 'MED BOX'
   AND  l_quantity 
           < (
                   SELECT
                            2e-1 * AVG(l_quantity)
                   FROM
                            lineitem
                   WHERE
                            l_partkey = p_partkey
                )

Deceptively simple? This calculates the total value of small orders (below 1/5 of average quantity for the part) for all parts of a given brand with a specific container.

If there is an index on l_partkey, the plan is easy enough: Take the parts, look up the average quantity for each, then recheck lineitem and add up the small lineitems. This takes about 1s. But we do not want indices for this workload.

If we made a hash from l_partkey to l_quantity for all lineitems, we could run out of space, and this would take so long the race would be automatically lost on this point alone. The trick is to import the restriction on l_partkey into the hash build. This gives us a plan that does a scan of lineitem twice, doing a very selective hash join (few parts). There is a lookup for the average for each lineitem with the part. The average is calculated potentially several times.

The below plan is workable but better is possible: We notice that the very selective join need be done just once; it is cheaper to remember the result than to do it twice, and the result is not large. The other trick is that the correlated subquery can be rewritten as

SELECT
        ... 
  FROM
        lineitem, 
        part, 
        ( SELECT 
                                           l_partkey, 
                 0.2 * AVG (l_quantity) AS qty 
            FROM 
                 lineitem, 
                 part 
            ...
        ) f 
 WHERE
        l_partkey = f.l_partkey 
 ...

In this form, one can put the entire derived table f on the build side of a hash join. In this way, the average is never done more than once per part.

{ 
time   7.9e-06% fanout         1 input         1 rows
time    0.0031% fanout         1 input         1 rows
{ hash filler
time      0.27% fanout     20031 input         1 rows
PART     2e+04 rows(.P_PARTKEY)
 P_BRAND =  <c Brand#23> ,  P_CONTAINER =  <c MED BOX>
time   0.00047% fanout         0 input     20031 rows
Sort hf 34 (.P_PARTKEY)
}
time       0.1% fanout         1 input         1 rows
{ hash filler
Subquery 40 
{ 
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(t4.L_PARTKEY, t4.L_QUANTITY)
 
hash partition+bloom by 38 (tmp)hash join merged always card     0.001 -> ()
time    0.0042% fanout         1 input    600982 rows
Hash source 34 merged into ts not partitionable     0.001 rows(t4.L_PARTKEY) -> ()
 
After code:
      0: t4.L_PARTKEY :=  := artm t4.L_PARTKEY
      4: t4.L_QUANTITY :=  := artm t4.L_QUANTITY
      8: BReturn 0
time     0.059% fanout         0 input    600982 rows
Sort hf 62 (t4.L_PARTKEY) -> (t4.L_QUANTITY)
 
}
}
time   6.8e-05% fanout         1 input         1 rows
{ fork
time        46% fanout    600982 input         1 rows
LINEITEM     6e+08 rows(.L_PARTKEY, .L_QUANTITY, .L_EXTENDEDPRICE)
 
hash partition+bloom by 38 (tmp)hash join merged always card   0.00052 -> ()
time   0.00021% fanout         1 input    600982 rows
Hash source 34 merged into ts    0.00052 rows(.L_PARTKEY) -> ()
 
Precode:
      0: .P_PARTKEY :=  := artm .L_PARTKEY
      4: BReturn 0
END Node
After test:
      0: { 
time     0.038% fanout         1 input    600982 rows
time      0.17% fanout         1 input    600982 rows
{ fork
time       6.8% fanout         0 input    600982 rows
Hash source 62  not partitionable      0.03 rows(k_.P_PARTKEY) -> (.L_QUANTITY)
 
After code:
      0:  sum sum.L_QUANTITYset no set_ctr
      5:  sum count 1 set no set_ctr
      10: BReturn 0
}
 
After code:
      0: temp := artm sum / count
      4: temp := artm  0.2  * temp
      8: aggregate :=  := artm temp
      12: BReturn 0
time     0.042% fanout         0 input    600982 rows
Subquery Select(aggregate)
}
 
      8: if (.L_QUANTITY < scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
 
After code:
      0:  sum sum.L_EXTENDEDPRICE
      5: BReturn 0
}
 
After code:
      0: avg_yearly := artm sum /  7 
      4: BReturn 0
time   4.6e-06% fanout         0 input         1 rows
Select (avg_yearly)
}


 2695 msec 1996% cpu,         3 rnd 1.18242e+09 seq         0% same seg         0% same pg 

2.7s is tolerable, but if this drags down the overall score by too much, we know that a 2+x improvement is readily available. Playing the rest of the tricks would result in the hash plan almost catching up with the 1s execution time of the index-based plan.

*           *           *           *           *

Q20 is not very long-running, but it is maybe the hardest to optimize of the lot. But as usual, failure to recognize its most salient traps will automatically lose the race, so pay attention.

SELECT TOP 100
           s_name,
           s_address
     FROM
           supplier,
           nation
    WHERE
           s_suppkey IN 
             ( SELECT  
                       ps_suppkey
                 FROM  
                       partsupp
                WHERE  
                       ps_partkey IN 
                         ( SELECT  
                                   p_partkey
                             FROM  
                                   part
                            WHERE  
                                   p_name LIKE 'forest%'
                         )
                  AND  ps_availqty > 
                         ( SELECT  
                                   0.5 * SUM(l_quantity)
                             FROM  
                                   lineitem
                            WHERE  
                                   l_partkey = ps_partkey
                              AND  l_suppkey = ps_suppkey
                              AND  l_shipdate >= CAST ('1994-01-01' AS DATE)
                              AND  l_shipdate < DATEADD ('year', 1, CAST ('1994-01-01' AS DATE))
                         )
             )
      AND  s_nationkey = n_nationkey
      AND  n_name = 'CANADA'
 ORDER BY  s_name

This identifies suppliers that have parts in stock in excess of half a year's shipments of said part.

The use of IN to denote a join is the first catch. The second is joining to lineitem by hash without building an overly large hash table. We know that IN becomes EXISTS which in turn can become a join as follows:

SELECT 
        l_suppkey 
FROM
        lineitem 
WHERE
        l_partkey IN  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_name LIKE 'forest%'
       )
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem 
 WHERE  EXISTS  
          ( SELECT  
                    p_partkey 
              FROM  
                    part 
             WHERE  
                    p_partkey = l_partkey 
               AND  p_name LIKE 'forest%')
;

-- is --

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        ( SELECT  
                    DISTINCT p_partkey 
            FROM  
                    part 
           WHERE  
                    p_name LIKE 'forest%') f 
 WHERE  
        l_partkey = f.p_partkey
;

But since p_partkey is unique, the DISTINCT drops off and we have.

SELECT  
        l_suppkey 
  FROM  
        lineitem, 
        part 
 WHERE  
        p_name LIKE 'forest% 
   AND  l_partkey = f.p_partkey
;

You see, the innermost IN with the ps_partkey goes through all these changes, and just becomes a join. The outermost IN stays as a distinct derived table, since ps_suppkey is not unique, and the meaning of IN is not to return a given supplier more than once.

The derived table is flattened and the DISTINCT is done partitioned; hence the stage node in front of the distinct. A DISTINCT can be multithreaded, if each thread gets a specific subset of all the keys. The stage node is an exchange of tuples between several threads. Each thread then does a TOP k sort. The TOP k trick we saw in Q18 is used, but does not contribute much here.

{ 
time   8.2e-06% fanout         1 input         1 rows
time   0.00017% fanout         1 input         1 rows
{ hash filler
time   6.1e-05% fanout         1 input         1 rows
NATION         1 rows(.N_NATIONKEY)
 N_NAME =  <c CANADA>
time   1.2e-05% fanout         0 input         1 rows
Sort hf 34 (.N_NATIONKEY)
}
time     0.073% fanout         1 input         1 rows
{ hash filler
time       4.1% fanout    240672 input         1 rows
PART   2.4e+05 rows(t74.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time     0.011% fanout         0 input    240672 rows
Sort hf 47 (t74.P_PARTKEY)
}
time      0.69% fanout         1 input         1 rows
{ hash filler
Subquery 56 
{ 
time        42% fanout 1.09657e+06 input         1 rows
LINEITEM   9.1e+07 rows(t76.L_PARTKEY, t76.L_SUPPKEY, t76.L_QUANTITY)
 L_SHIPDATE >= <c 1994-01-01> < <c 1995-01-01>
hash partition+bloom by 54 (tmp)hash join merged always card     0.012 -> ()
time     0.022% fanout         1 input 1.09657e+06 rows
Hash source 47 merged into ts not partitionable     0.012 rows(t76.L_PARTKEY) -> ()
 
After code:
      0: t76.L_PARTKEY :=  := artm t76.L_PARTKEY
      4: t76.L_SUPPKEY :=  := artm t76.L_SUPPKEY
      8: t76.L_QUANTITY :=  := artm t76.L_QUANTITY
      12: BReturn 0
time      0.22% fanout         0 input 1.09657e+06 rows
Sort hf 80 (t76.L_PARTKEY, t76.L_SUPPKEY) -> (t76.L_QUANTITY)
 
}
}
time   2.1e-05% fanout         1 input         1 rows
time   3.2e-05% fanout         1 input         1 rows
{ fork
time       5.3% fanout    240672 input         1 rows
PART   2.4e+05 rows(t6.P_PARTKEY)
 P_NAME LIKE  <c forest%> LIKE  <c >
time       1.9% fanout         4 input    240672 rows
PARTSUPP       1.2 rows(t4.PS_AVAILQTY, t4.PS_PARTKEY, t4.PS_SUPPKEY)
 inlined  PS_PARTKEY = t6.P_PARTKEY
time        16% fanout  0.680447 input    962688 rows
END Node
After test:
      0: { 
time      0.08% fanout         1 input    962688 rows
time       9.4% fanout         1 input    962688 rows
{ fork
time       3.6% fanout         0 input    962688 rows
Hash source 80       0.013 rows(k_t4.PS_PARTKEY, k_t4.PS_SUPPKEY) -> (t8.L_QUANTITY)
 
After code:
      0:  sum sumt8.L_QUANTITYset no set_ctr
      5: BReturn 0
}
 
After code:
      0: temp := artm  0.5  * sum
      4: aggregate :=  := artm temp
      8: BReturn 0
time      0.85% fanout         0 input    962688 rows
Subquery Select(aggregate)
}
 
      8: if (t4.PS_AVAILQTY > scalar) then 12 else 13 unkn 13
      12: BReturn 1
      13: BReturn 0
time         1% fanout         1 input    655058 rows
Stage 2
time     0.071% fanout         1 input    655058 rows
Distinct (q_t4.PS_SUPPKEY)
 
After code:
      0: PS_SUPPKEY :=  := artm t4.PS_SUPPKEY
      4: BReturn 0
time     0.016% fanout         1 input    655058 rows
Subquery Select(PS_SUPPKEY)
time       3.2% fanout 0.0112845 input    655058 rows
SUPPLIER unq     0.075 rows (.S_NAME, .S_NATIONKEY, .S_ADDRESS)
 inlined  S_SUPPKEY = PS_SUPPKEY
hash partition+bloom by 38 (tmp)hash join merged always card      0.04 -> ()
top k on S_NAME
time    0.0012% fanout         1 input      7392 rows
Hash source 34 merged into ts       0.04 rows(.S_NATIONKEY) -> ()
time     0.074% fanout         0 input      7392 rows
Sort (.S_NAME) -> (.S_ADDRESS)
 
}
time   0.00013% fanout       100 input         1 rows
top order by read (.S_NAME, .S_ADDRESS)
time     5e-06% fanout         0 input       100 rows
Select (.S_NAME, .S_ADDRESS)
}


 1777 msec 1355% cpu,    894483 rnd 6.39422e+08 seq   79.1214% same seg   19.3093% same pg 

1.8s is sufficient, and in the ballpark with VectorWise. Some further gain is possible, as the lineitem hash table can also be restricted by supplier; after all, only 1/25 of all suppliers are in the end considered. Further simplifications are possible. Another 20% of time could be saved. The tricks are however quite complex and specific, and there are easier gains to be had -- for example, in reusing intermediates in Q17 and Q15.

The next installment will discuss late projection and some miscellaneous tricks not mentioned so far. After this, we are ready to take an initial look at the performance of the system as a whole.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
03/20/2014 16:03 GMT-0500 Modified: 04/07/2014 12:36 GMT-0500
Linked Geospatial Data 2014 Workshop, Part 4: GeoKnow, London, Brussels, The Message

Last Friday (2014-03-14) I gave a talk about GeoKnow at the EC Copernicus Big Data workshop. This was a trial run for more streamlined messaging. I have, aside the practice of geekcraft, occupied myself with questions of communication these last weeks.

The clear take-home from London and Brussels alike is that these events have full days and 4 or more talks an hour. It is not quite TV commercial spots yet but it is going in this direction.

If you say something complex, little will get across unless the audience already knows what you will be saying.

I had a set of slides from Jens Lehmann, the GeoKnow project coordinator, for whom I was standing in. Now these are a fine rendition of the description of work. What is wrong with partners, work packages, objectives, etc? Nothing, except everybody has them.

I recall the old story about the journalist and the Zen master: The Zen master repeatedly advises the reporter to cut the story in half. We get the same from PR professionals, "If it is short, they have at least thought about what should go in there," said one recently, talking of pitches and messages. The other advice was to use pictures. And to have a personal dimension to it.

Enter "Ms. Globe" and "Mr. Cube". Frans Knibbe of Geodan gave the Linked Geospatial Data 2014 workshop's most memorable talk entitled "Linked Data and Geoinformatics - a love story" (pdf) about the excitement and the pitfalls of the burgeoning courtship of Ms. Globe (geoinformatics) and Mr. Cube (semantic technology). They get to talking, later Ms. Globe thinks to herself... "Desiloisazation, explicit semantics, integrated metadata..." Mr. Cube, young upstart now approaching a more experienced and sophisticated lady, dreams of finally making an entry into adult society, "critical mass, global scope, relevant applications..." There is a vibration in the air.

So, with Frans Knibbe's gracious permission I borrowed the storyline and some of the pictures.

We ought to make a series of cartoons about the couple. There will be twists and turns in the story to come.

Mr. Cube is not Ms. Globe's first lover, though; there is also rich and worldly Mr. Table. How will Mr. Cube prove himself? The eternal question... Well, not by moping around, not by wise-cracking about semantics, no. By boldly setting out upon a journey to fetch the Golden Fleece from beyond the crashing rocks. "Column store, vectored execution, scale out, data clustering, adaptive schema..." he affirms, with growing confidence.

This is where the story stands, right now. Virtuoso run circles around PostGIS doing aggregations and lookups on geometries in a map-scrolling scenario (GeoKnow's GeoBenchLab). Virtuoso SPARQL outperforms PostGIS SQL against planet-scale OpenStreetMap; Virtuoso SQL goes 5-10x faster still.

Mr Cube is fast on the draw, but still some corners can be smoothed out.

Later in GeoKnow, there will be still more speed but also near parity between SQL and SPARQL via taking advantage of data regularity in guiding physical storage. If it is big, it is bound to have repeating structure.

The love story grows more real by the day. To be consummated still within GeoKnow.

Talking of databases has the great advantage that this has been a performance game from the start. There are few people who need convincing about the desirability of performance, as this also makes for lower cost and more flexibility on the application side.

But this is not all there is to it.

In Brussels, the public was about E-science (Earth observation). In science, it is understood that qualitative aspects can be even more crucial. I told the story about an E-science-oriented workshop I attended in America years ago. The practitioners, from high energy physics to life sciences to climate, had invariably come across the need for self-description of data and for schema-last. This was essentially never provided by RDF, except for some life science cases. Rather, we had one-off schemes, ranging from key-value pairs to putting the table name in a column of the same table to preserve the origin across data export.

Explicit semantics and integrated metadata are important, Ms. Globe knows, but she cannot sacrifice operational capacity for this. So it is more than a DBMS or even data model choice -- there must be a solid tool chain for data integration and visualization. GeoKnow provides many tools in this space.

Some of these, such as the LIMES entity matching framework (pdf) are probably close to the best there is. For other parts, the SQL-based products with hundreds of person years invested in user interaction are simply unbeatable.

In these cases, the world can continue to talk SQL. If the regular part of the data is in fact tables already, so much the better. You connect to Virtuoso via SQL, just like to PostGIS or Oracle Spatial, and talk SQL MM. The triples, in the sense of flexible annotation and integrated metadata, stay there; you just do not see them if you do not want them.

There are possibilities all right. In the coming months I will showcase some of the progress, starting with a detailed look at the OpenStreetMap experiments we have made in GeoKnow.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT-0500 Modified: 03/18/2014 10:52 GMT-0500
Linked Geospatial Data 2014 Workshop, Part 3: The Stellar Reach of OKFN

The Open Knowledge Foundation (OKFN) held a London Open Data Meetup in the evening of the first day of the Linked Geospatial Data 2014 workshop. The event was, as they themselves put it, at the amazing open concept office of OKFN at the Center for Creative Collaboration in Central London. What could sound cooler? True, OKFN threw a good party, with ever engaging and charismatic founder Rufus Pollock presiding. Phil Archer noted, only half in jest, that OKFN was so influential, visible, had the ear of government and public alike, etc., that it put W3C to shame.

Now, OKFN is a party in the LOD2 FP7 project, so I have over the years met people from there on and off. In LOD2, OKFN is praised to the skies for its visibility and influence and outreach and sometimes, in passing, critiqued for not publishing enough RDF, let alone five star linked data.

As it happens, CSV rules, and even the W3C will, it appears, undertake to standardize a CSV-to-RDF mapping. As far as I am concerned, as long as there is no alignment of identifiers or vocabulary, whether a thing is CSV or exactly equivalent RDF, there is no great difference, except that CSV is smaller and loads into Excel.

For OKFN, which has a mission of opening data, insisting on any particular format would just hinder the cause.

What do we learn from this? OKFN is praised not only for government relations but also for developer friendliness. Lobbying for open data is something I can understand, but how do you do developer relations? This is not like talking to customers, where the customer wants to do something and it is usually possible to give some kind of advice or recommendation on how they can use our technology for the purpose.

Are JSON and Mongo DB the key? A well renowned database guy once said that to be with the times, JSON is your data model, Hadoop your file system, Mongo DB your database, and JavaScript your language, and failing this, you are an old fart, a legacy suit, well, some uncool fossil.

The key is not limited to JSON. More generally, it is zero time to some result and no learning curve. Some people will sacrifice almost anything for this, such as the possibility of doing arbitrary joins. People will even write code, even lots of it, if it only happens to be in their framework of choice.

Phil again deplored the early fiasco of RDF messaging. "Triples are not so difficult. It is not true that RDF has a very steep learning curve." I would have to agree. The earlier gaffes of the RDF/XML syntax and the infamous semantic web layer cake diagram now lie buried and unlamented; let them be.

Generating user experience from data or schema is an old mirage that has never really worked out. The imagined gain from eliminating application writing has however continued to fascinate IT minds and attempts in this direction have never really ceased. The lesson of history seems to be that coding is not to be eliminated, but that it should have fast turnaround time and immediately visible results.

And since this is the age of data, databases should follow this lead. Schema-last is a good point, maybe adding JSON alongside XML as an object type in RDF might not be so bad. There are already XML functions, so why not the analog for JSON? Just don't mention XML to the JSON folks...

How does this relate to OKFN? Well, in the first instance this is the cultural impression I received from the meetup, but in a broader sense these factors are critical to realizing the full potential of OKFN's successes so far. OKFN is a data opening advocacy group; it is not a domain-specific think tank or special interest group. The data owners and their consultants will do analytics and even data integration if they see enough benefit in this, all in the established ways. However, the widespread opening of data does create possibilities that did not exist before. Actual benefits depend in great part on constant lowering of access barriers, and on a commitment by publishers to keep the data up to date, so that developers can build more than just a one-off mashup.

True, there are government users of open data, since there is a productivity gain in already having the neighboring department's data opened to a point; one does no longer have to go through red tape to gain access to it.

For an application ecosystem to keep growing on the base of tens of thousands of very heterogeneous datasets coming into the open, continuing to lower barriers is key. This is a very different task from making faster and faster databases or of optimizing a particular business process, and it demands different thinking.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT-0500
Linked Geospatial Data 2014 Workshop, Part 2: Is SPARQL Slow?

I had a conversation with Andy Seaborne of Epimorphics, initial founder of the Jena RDF Framework tool chain and editor of many W3C recommendations, among which the two SPARQLs. We exchanged some news; I told Andy about our progress in cutting the RDF-to-SQL performance penalty and doing more and better SQL tricks. Andy asked me if there were use cases doing analytics over RDF, not in the business intelligence sense, but in the sense of machine learning or discovery of structure. There is, in effect, such work, notably in data set summarization and description. A part of this has to do with learning the schema, like one would if wanting to put triples into tables when appropriate. CWI in LOD2 has worked in this direction, as has DERI (Giovanni Tummarello's team), in the context of giving hints to SPARQL query writers. I would also mention Chris Bizer et al., at University of Mannheim, with their data integration work, which is all about similarity detection in a schema-less world, e.g., the 150M HTML tables in the Common Crawl, briefly mentioned in the previous blog. Jens Lehmann from University of Leipzig has also done work in learning a schema from the data, this time in OWL.

Andy was later on a panel where Phil Archer asked him whether SPARQL was slow by nature or whether this was a matter of bad implementations. Andy answered approximately as follows: "If you allow for arbitrary ad hoc structure, you will always pay something for this. However, if you tell the engine what your data is like, it is no different from executing SQL." This is essentially the gist of our conversation. Most likely we will make this happen via adaptive schema for the regular part and exceptions as quads.

Later I talked with Phil about the "SPARQL is slow" meme. The fact is that Virtuoso SPARQL will outperform or match PostGIS SQL for Geospatial lookups against the OpenStreetMap dataset. Virtuoso SQL will win by a factor of 5 to 10. Still, the SPARQL is slow meme is not entirely without a basis in fact. I would say that the really blatant cases that give SPARQL a bad name are query optimization problems. With 50 triple patterns in a query there are 50-factorial ways of getting a bad plan. This is where the catastrophic failures of 100+ times worse than SQL come from. The regular penalty of doing triples vs tables is somewhere between 2.5 (Star Schema Benchmark) and 10 (lookups with many literals), quite acceptable for many applications. Some really bad cases can occur with regular expressions on URI strings or literals, but then, if this is the core of the application, it should use a different data model or an n-gram index.

The solutions, including more dependable query plan choice, will flow from adaptive schema which essentially reduces RDF back into relational, however without forcing schema first and with accommodation for exceptions in the data.

Phil noted here that there already exist many (so far, proprietary) ways of describing the shape of a graph. He said there would be a W3C activity for converging these. If so, a vocabulary that can express relationships, the types of related entities, their cardinalities, etc., comes close to a SQL schema and its statistics. Such a thing can be the output of data analysis, or the input to a query optimizer or storage engine, for using a schema where one in fact exists. Like this, there is no reason why things would be less predictable than with SQL. The idea of a re-convergence of data models is definitely in the air; this is in no sense limited to us.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT-0500
Linked Geospatial Data 2014 Workshop, Part 1: Web Services or SPARQL Modeling?

The W3C (World Wide Web Consortium) and OGC (Open Geospatial Consortium) organized the Linked Geospatial Data 2014 workshop in London this week. The GeoKnow project was represented by Claus Stadler of Universität Leipzig, and Hugh Williams and myself (Orri Erling) from OpenLink Software. The Open Knowledge Foundation (OKFN) also held an Open Data Meetup in the evening of the first day of the workshop.

Reporting on each talk and the many highly diverse topics addressed is beyond the scope of this article; for this you can go to the program and the slides that will be online. Instead, I will talk about questions that to me seemed to be in the air, and about some conversations I had with the relevant people.

The trend in events like this is towards shorter and shorter talks and more and more interaction. In this workshop, talks were given in series of three talks with all questions at the end, with all the presenters on stage. This is not a bad idea since we get a panel-like effect where many presenters can address the same question. If the subject matter allows, a panel is my preferred format.

Web services or SPARQL? Is GeoSPARQL good? Is it about Linked Data or about ontologies?

Geospatial data tends to be exposed via web services, e.g., WFS (Web Feature Service). This allows item retrieval on a lookup basis and some predefined filtering, transformation, and content negotiation. Capabilities vary; OGC now has WFS 2.0, and there are open source implementations that do a fair job of providing the functionality.

Of course, a real query language is much more expressive, but a service API is more scalable, as people say. What they mean is that an API is more predictable. For pretty much any complex data task, a query language is near-infinitely more efficient than going back-and-forth, often on a wide area network, via an API. So, as Andreas Harth put it: for data publishers, make an API; an open SPARQL endpoint is too "brave," [Andreas' word, with the meaning of foolhardy]. When you analyze, he continued, then you load it into a endpoint, but you use your own. Any quality of service terms must be formulated with respect to a fixed workload, this is not meaningful with ad hoc queries in an expressive language. Things like anytime semantics (return whatever is found within a time limit) are only good for a first interactive look, not for applications.

Should the application go to the data or the reverse? Some data is big and moving it is not self-evident. A culture of datasets being hosted on a cloud may be forming. Of course some linked data like DBpedia has for a long time been available as Amazon images. Recently, SindiceTech has made a similar packaging of Freebase. The data of interest here is larger and its target audience is more specific, on the e-science side.

How should geometries be modeled? I have met the GeoSPARQL and the SQL MM on which it is based with a sense of relief, as these are reasonable things that can be efficiently implemented. There are proposals where points have URIs, and linestrings are ordered sets of points, and collections are actual trees with RDF subjects as nodes. As a standard, such a thing is beyond horrible, as it hits all the RDF penalties and overheads full force, and promises easily 10x worse space consumption and 100x worse run times compared to the sweetly reasonable GeoSPARQL. One presenter said that cases of actually hanging attributes off points of complex geometries had been heard of but were, in his words, anecdotal. He posed a question to the audience about use cases where points in fact needed separately addressable identities. Several cases did emerge, involving, for example, different measurement certainties for different points on on a trajectory trace obtained by radar. Applications that need data of this sort will perforce be very domain specific. OpenStreetMap (OSM) itself is a bit like this, but there the points that have individual identity also have predominantly non-geometry attributes and stand for actually-distinct entities. OSM being a practical project, these are then again collapsed into linestrings for cases where this is more efficient. The OGC data types themselves have up to 4 dimensions, of which the 4th could be used as an identifier of a point in the event this really were needed. If so, this would likely be empty for most points and would compress away if the data representation were done right.

For data publishing, Andreas proposed to give OGC geometries URIs, i.e., the borders of a country can be more or less precisely modeled, and the large polygon may have different versions and provenances. This is reasonable enough, as long as the geometries are big. For applications, one will then collapse the 1:n between entity and its geometry into a 1:1. In the end, when you make an application, even an RDF one, you do not just throw all the data in a bucket and write queries against that. Some alignment and transformation is generally involved.

Linked Geospatial Data 2014 Workshop posts:

# PermaLink Comments [0]
03/18/2014 10:45 GMT-0500
In Hoc Signo Vinces (part 9 of n): TPC-H Q18, Ordered Aggregation, and Top K

We will here return to polishing the cutting edge, the high geekcraft of database. We will look at more of the wonders of TPC-H and cover two more tricks. The experts can skip the preliminaries and go to the query profiles; for the others, there is some explanation first.

From the TPC-H specification:

    SELECT  TOP 100
                     c_name,
                     c_custkey,
                     o_orderkey,
                     o_orderdate,
                     o_totalprice,
                     SUM ( l_quantity )
     FROM  customer,
           orders,
           lineitem
    WHERE  o_orderkey 
             IN 
               (
                  SELECT  l_orderkey
                    FROM  lineitem
                GROUP BY  l_orderkey 
                            HAVING
                              SUM ( l_quantity ) > 312
               )
      AND  c_custkey = o_custkey
      AND  o_orderkey = l_orderkey
 GROUP BY  c_name,
           c_custkey,
           o_orderkey,
           o_orderdate,
           o_totalprice
 ORDER BY  o_totalprice DESC, 
           o_orderdate 

The intent of the query is to return order and customer information for cases where an order involves a large quantity of items, with highest-value orders first.

We note that the only restriction in the query is the one on the SUM of l_quantity in the IN subquery. Everything else is a full scan or a JOIN on a foreign key.

Now, the first query optimization rule of thumb could be summarized as start from the small. Small here means something that is restricted; it does not mean small table. Smallest is the one from which the highest percentage is dropped via a condition that does not depend on other tables.

The next rule of thumb is to try starting from the large, if the large has a restricting join; for example, scan all the lineitems and hash join to parts that are green and of a given brand. In this case, the idea is to make a hash table from the small side and sequentially scan the large side, dropping everything that does not match something in the hash table.

The only restriction here is on orders via a join on lineitem. So, the IN subquery can be flattened, so as to read like --

SELECT ... 
  FROM  (   SELECT  l_orderkey, 
                    SUM ( l_quantity ) 
              FROM  lineitem 
          GROUP BY  l_orderkey 
                      HAVING
                        SUM ( l_quantity ) > 312
        ) f, 
          orders, 
          customer, 
          lineitem 
 WHERE  f.l_orderkey = o_orderkey ....

The above (left to right) is the best JOIN order for this type of plan. We start from the restriction, and for all the rest the JOIN is foreign key to primary key, sometimes n:1 (orders to customer), sometimes 1:n (orders to lineitem). A 1:n is usually best by index; an n:1 can be better by hash if there are enough tuples on the n side to make it worthwhile to build the hash table.

We note that the first GROUP BY makes a very large number of groups, e.g., 150M at 100 Gtriple scale. We also note that if lineitem is ordered so that the lineitems of a single order are together, the GROUP BY is ordered. In other words, once you have seen a specific value of l_orderkey change to the next, you will not see the old value again. In this way, the groups do not have to be remembered for all time. The GROUP BY produces a stream of results as the scan of lineitem proceeds.

Considering vectored execution, the GROUP BY does remember a bunch of groups, up to a vector size worth, so that output from the GROUP BY is done in large enough batches, not a tuple at a time.

Considering parallelization, the scan of lineitem must be split in such a way that all lineitems with the same l_orderkey get processed by the same thread. If this is the case, all threads will produce an independent stream of results that is guaranteed to need no merge with the output of another thread.

So, we can try this:

{ 
time     6e-06% fanout         1 input         1 rows
time       4.5% fanout         1 input         1 rows
{ hash filler

-- Make a hash table from c_custkey to c_name

time      0.99% fanout   1.5e+07 input         1 rows
CUSTOMER   1.5e+07 rows(.C_CUSTKEY, .C_NAME)
 
time      0.81% fanout         0 input   1.5e+07 rows
Sort hf 35 (.C_CUSTKEY) -> (.C_NAME)
 
}
time   2.2e-05% fanout         1 input         1 rows
time   1.6e-05% fanout         1 input         1 rows
{ fork
time   5.2e-06% fanout         1 input         1 rows
{ fork

-- Scan lineitem

time        10% fanout 6.00038e+08 input         1 rows
LINEITEM     6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)

-- Ordered GROUP BY (streaming with duplicates)

time        73% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)

-- The ordered aggregation above emits a batch of results every so often, having accumulated 20K or so groups (DISTINCT l_orderkey's)

-- The operator below reads the batch and sends it onward, the GROUP BY hash table for the next batch.

time        10% fanout   21231.4 input      7065 rows
group by read node 
(t5.L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
 
After code:
      0: L_ORDERKEY :=  := artm t5.L_ORDERKEY
      4: BReturn 0

-- This marks the end of the flattened IN subquery. 1063 out of 150M groups survive the test on the SUM of l_quantity.

-- The main difficulty of Q18 is guessing that this condition is this selective.

time    0.0013% fanout         1 input      1063 rows
Subquery Select(L_ORDERKEY)
time     0.058% fanout         1 input      1063 rows
ORDERS unq      0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
 inlined  O_ORDERKEY = L_ORDERKEY
hash partition+bloom by 42 (tmp)hash join merged always card      0.99 -> (.C_NAME)
time    0.0029% fanout         1 input      1063 rows
Hash source 35 merged into ts       0.99 rows(.O_CUSTKEY) -> (.C_NAME)
 
After code:
      0: .C_CUSTKEY :=  := artm .O_CUSTKEY
      4: BReturn 0
time     0.018% fanout         7 input      1063 rows
LINEITEM       4.3 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = .O_ORDERKEY
time     0.011% fanout         0 input      7441 rows
Sort (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
 
}
time   0.00026% fanout      1063 input         1 rows
group by read node  
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time   0.00061% fanout         0 input      1063 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
 
}
time   1.7e-05% fanout       100 input         1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time   1.2e-06% fanout         0 input       100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}


 6351 msec 1470% cpu,      2151 rnd 6.14898e+08 seq  0.185874% same seg   1.57993% same pg 

What is wrong with this? The result is not bad, in the ballpark with VectorWise published results (4.9s on a slightly faster box), but better is possible. We note that there is a hash join from orders to customer. Only 1K customers of 15M get hit. The whole hash table of 15M entries is built in vain. Let's cheat and declare the join to be by index. Cheats like this are not allowed in an official run but here we are just looking. So we change the mention of the customer table in the FROM clause from FROM ... customer, ... to FROM ... customer TABLE OPTION (loop), ...

{ 
time   1.4e-06% fanout         1 input         1 rows
time     9e-07% fanout         1 input         1 rows

-- Here was the hash build in the previous plan; now we start direct with the scan of lineitem.

time   2.2e-06% fanout         1 input         1 rows
{ fork
time   2.3e-06% fanout         1 input         1 rows
{ fork
time        11% fanout 6.00038e+08 input         1 rows
LINEITEM     6e+08 rows(t5.L_ORDERKEY, t5.L_QUANTITY)
 
time        78% fanout 1.17743e-05 input 6.00038e+08 rows
Sort streaming with duplicates (t5.L_ORDERKEY) -> (t5.L_QUANTITY)
 
time        11% fanout   21231.4 input      7065 rows
group by read node  
(t5.L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
 
After code:
      0: L_ORDERKEY :=  := artm t5.L_ORDERKEY
      4: BReturn 0
time    0.0014% fanout         1 input      1063 rows
Subquery Select(L_ORDERKEY)
time     0.051% fanout         1 input      1063 rows
ORDERS unq      0.97 rows (.O_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE)
 inlined  O_ORDERKEY = L_ORDERKEY

-- We note that getting the 1063 customers by index takes no time, and there is no hash table to build

time     0.023% fanout         1 input      1063 rows
CUSTOMER unq      0.99 rows (.C_CUSTKEY, .C_NAME)
 inlined  C_CUSTKEY = .O_CUSTKEY
time     0.021% fanout         7 input      1063 rows
LINEITEM       4.3 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY

-- The rest is identical to the previous plan, cut for brevity

 3852 msec 2311% cpu,      3213 rnd 5.99907e+08 seq  0.124456% same seg   1.08899% same pg 
Compilation: 1 msec 0 reads         0% read 0 messages         0% clw

We save over 2s of real time. But the problem is how to know that very few customers will be hit. One could make a calculation that l_quantity is between 1 and 50, and that an order has an average of 4 lineitems with a maximum of 7. For the SUM to be over 312, only orders with 7 lineitems are eligible, and even so the l_quantities must all be high. Assuming flat distributions, which here happens to be the case, one could estimate that the condition selects very few orders. The problem is that real data with this kind of regularity is sight unseen, so such a trick, while allowed, would just work for benchmarks.

*           *           *           *           *

As it happens, there is a better way. We also note that the query selects the TOP 100 orders with the highest o_totalprice. This is a very common pattern; there is almost always a TOP k clause in analytics queries unless they GROUP BY something that is known to be of low cardinality, like nation or year.

If the ordering falls on a grouping column, as soon as there are enough groups generated to fill a TOP 100, one can take the lowest o_totalprice as a limit and add this into the query as an extra restriction. Every time the TOP 100 changes, the condition becomes more selective, as the 100th highest o_totalprice increases.

Sometimes the ordering falls on the aggregation result, which is not known until the aggregation is finished. However, in lookup-style queries, it is common to take the latest-so-many events or just the TOP k items by some metric. In these cases, pushing the TOP k restriction down into the selection always works.

So, we try this:

{ 
time     4e-06% fanout         1 input         1 rows
time   6.1e-06% fanout         1 input         1 rows
{ fork

-- The plan begins with orders, as we now expect a selection on o_totalprice

-- We see that out of 150M orders, a little over 10M survive the o_totalprice selection, which gets more restrictive as the query proceeds.

time        33% fanout 1.00628e+07 input         1 rows
ORDERS   4.3e+04 rows(.O_TOTALPRICE, .O_ORDERKEY, .O_CUSTKEY, .O_ORDERDATE)
 
top k on O_TOTALPRICE
time        32% fanout 3.50797e-05 input 1.00628e+07 rows
END Node
After test:
      0: if ({ 

-- The IN subquery is here kept as a subquery, not flattened.

time      0.42% fanout         1 input 1.00628e+07 rows
time        11% fanout   4.00136 input 1.00628e+07 rows
LINEITEM         4 rows(.L_ORDERKEY, .L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY
time        21% fanout 2.55806e-05 input 4.02649e+07 rows
Sort streaming with duplicates (set_ctr, .L_ORDERKEY) -> (.L_QUANTITY)
 
time       2.4% fanout   9769.72 input      1030 rows
group by read node  
(gb_set_no, .L_ORDERKEY, aggregate)
END Node
After test:
      0: if (aggregate >  312 ) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0
time   0.00047% fanout         0 input       353 rows
Subquery Select(  )
}
) then 4 else 5 unkn 5
      4: BReturn 1
      5: BReturn 0

  

-- Here we see that fewer customers are accessed than in the non-TOP k plans, since there is an extra cut on o_totalprice that takes effect earlier

time     0.013% fanout         1 input       353 rows
CUSTOMER unq         1 rows (.C_CUSTKEY, .C_NAME)
 inlined  C_CUSTKEY = k_.O_CUSTKEY
time    0.0079% fanout         7 input       353 rows
LINEITEM         4 rows(.L_QUANTITY)
 inlined  L_ORDERKEY = k_.O_ORDERKEY
time    0.0063% fanout 0.0477539 input      2471 rows
Sort streaming with duplicates (.C_CUSTKEY, .O_ORDERKEY) -> (.L_QUANTITY, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
 
time    0.0088% fanout   2.99153 input       118 rows
group by read node  
(.C_CUSTKEY, .O_ORDERKEY, aggregate, .O_TOTALPRICE, .O_ORDERDATE, .C_NAME)
time    0.0063% fanout         0 input       353 rows
Sort (.O_TOTALPRICE, .O_ORDERDATE) -> (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, aggregate)
 
}
time   8.5e-05% fanout       100 input         1 rows
top order by read (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
time   2.7e-06% fanout         0 input       100 rows
Select (.C_NAME, .C_CUSTKEY, .O_ORDERKEY, .O_ORDERDATE, .O_TOTALPRICE, aggregate)
}


 949 msec 2179% cpu, 1.00486e+07 rnd 4.71013e+07 seq   99.9267% same seg 0.0318055% same pg 

Here we see that the time is about 4x better than with the cheat version. We note that about 10M of 1.5e8 orders get considered. After going through the first 10% or so of orders, there is a TOP 100, and a condition on o_totalprice that will drop most orders can be introduced.

If we set the condition on the SUM of quantity so that no orders match, there is no TOP k at any point, and we get a time of 6.8s, which is a little worse than the initial time with the flattened IN. But since the TOP k trick does not allocate memory, it is relatively safe even in cases where it does not help.

We can argue that the TOP k pushdown trick is more robust than guessing the selectivity of a SUM of l_quantity. Further, it applies to a broad range of lookup queries, while the SUM trick applies to only TPC-H Q18, or close enough. Thus, the TOP k trick is safer and more generic.

We are approaching the end of the TPC-H blog series, with still two families of tricks to consider, namely, moving predicates between subqueries, and late projection. After this we will look at results and the overall picture.

To be continued...

In Hoc Signo Vinces (TPC-H) Series

# PermaLink Comments [0]
03/17/2014 12:41 GMT-0500 Modified: 04/07/2014 12:36 GMT-0500
LOD2 Plenary and Open Data Meet-up in Mannheim

There was a LOD2 plenary meeting hosted by Chris Bizer from the University of Mannheim this week.

The plenary meeting was preceded by a Linked Open Data Meetup with talks from Springer, fluid Operations, and several LOD2 partners (Universität Leipzig, University of Mannheim, the Semantic Web Company, and Wolters Kluwer Deutschland GmbH (WKD)).

Wolters Kluwer Deutschland GmbH (WKD) gave a presentation on the content production pipeline of their legal publications and their experiences in incorporating LOD2 technologies for content enrichment. This is a very successful LOD2 use case and demonstrates the value of linked data for the information industry.

Springer gave a talk about their interest in linked data for enriching the Lecture Notes in Computer Science product. Also conference proceedings could be enhanced with structured metadata in RDF. I asked about nanopublications. The comment was that content authors might perceive nanopublications as an extra imposition. On the other hand, in the life sciences field there is a lot of enthusiasm for the idea. We will see; anyway, biology will likely lead the way for nanopublications. I referred Aliaksandr Birukou of Springer to the companies Euretos and its parent S&T in Delft, Netherlands, and to Barend Mons, scientific director of NBIC, the Netherlands Bioinformatics Centre. These are among the founding fathers of the Nano Republic, as they themselves put it.

Sebastian Hellman gave a talk on efforts to set up the DBpedia Foundation as a not-for-profit organization, hopefully in the next 10 days, to aid in the sustainability and growth of the DBpedia project. The Foundation would identify stakeholders, their interests, and ways to generate income to further improve DBpedia. Planned areas of improvement include the development of high-availability value-added DBpedia services with quality of service (QoS) agreements for enterprise users; additional tools in the DBpedia stack to support improved and cost-efficient data curation and internationalization; and improved documentation, tutorials, and support to speed uptake.

I had a word with Peter Haase of fluid Operations about the Optique project and their cloud management offerings. The claim is to do ontology-directed querying over thousands of terabytes of heterogenous data. This turns out to be a full-force attempt at large scale SQL federation with ontology-directed query rewriting for covering OWL 2 QL semantics. With Ian Horrocks of Oxford leading the ontology side, the matter is in good hands. Still the matter is not without its problems. Simple lookups can be directed to the data but if there are terabytes of it, it is more likely that aggregations are what is desired. Federated aggregation tends to move a lot of data. So the problems are as they ever were. However, if the analytics are already done and stored in the relational space, finding these based on ontologies is a worthwhile thing for streamlining end user access to information.

The LOD2 plenary itself was structured in the usual way, covering the work packages in two parallel tracks.

LOD2 Plenary Group Photo, Mannheim, February 2014
LOD2 Plenary Group Photo, Mannheim, February 2014

On the database side, the final victory will be won by going to adaptive schema for RDF. We brought the RDF penalty against relational to a factor of 2.5 for common analytics style queries, e.g., Star Schema Benchmark. This is a comparison to Virtuoso SQL, which offers very high performance in this workload, over 2x the speed of column store pioneer MonetDB and 300x MySQL. So this is where matters stand. To move them significantly forward, exploitation of structure for guiding physical storage will be needed. Also the project still has to deliver the 500 Gtriple results. The experiments around Christmas at CWI support the possibility, but they are not final. Putting triples into tables when the triples in fact form table-shaped structures, which is the case most of the time, may turn out to be necessary for this. At least, this will be a significant help.

Be the case as it may, using a table schema for regularly shaped data, while preserving the RDF quad flexibility, would essentially abolish the RDF tax and bring the LOD2 project to a glorious conclusion in August.

I took the poetic license to compare the data journey into RDF and back to the Egyptian myth of Osiris: The data gets shut in a silo and then gets cut into 14 pieces; and subsequently thrown into the Nile (i.e., the LOD cloud, or the CKAN catalog). Grief-stricken Isis sees what is become of her love: She patiently reassembles the pieces, reconstructing Osiris in fact so well that he sires her a child, hawk-headed Horus, who proceeds to reclaim his father’s honor. (See, Isis means Intelligent Structured Information Storage.)

I had many interesting conversations with Chris Bizer about his research in data integration, working with the 150M HTML tables in the common crawl. The idea is to resolve references and combine data from the tables. Interestingly enough, the data model in these situations is basically triples, while these are generally not stored as RDF but in Lucene. This makes sense due to the string-matching nature of the task. There appears to be opportunity in bringing together the state of the art in database, meaning the very highly optimized column-store and vectored execution in Virtuoso with the search-style workload found in instance matching and other data integration tasks. The promise goes in the direction of very fast ETL and subsequent discovery of structural commonalities and enrichment possibilities. This is also not infinitely far from the schema discovery that one may do in order to adaptively optimize storage based on the data.

Volha Bryl gave a very good overview of the Mannheim work in the data integration domain. For example, learning data fusion rules from examples of successful conflict resolution seems very promising. Learning text extraction rules from examples is also interesting. The problem of data integration is that the tasks are very heterogenous and therefore data integration suites have very large numbers of distinct tools. This is labor intensive but there is progress in automation. An error-free, or near enough, data product remains case by case and has human curation but automatic methods seem, based on Volha’s and Chris’ presentation, to be in the ballpark for statistics.

Giovanni Tummarello of Insight/SindiceTech, always the life of the party, presented his Solr-based relational faceted browser. The idea is to show and drill down by facets over a set of related tables; in the demo, this was investments, investment targets, and investors. You can look at the data from any of the points and restrict the search based on attributes of any. Well, this is what a database does, right? That is so, but the Sindice tool is on top of Solr and actually materializes joins into a document. This blows up the data but has all the things colocated so it can even run from disk. We also talked about the Knowledge Graph package Sindice offers on the Google cloud, this time a Virtuoso application.

We hope that negotiations between SindiceTech and Insight (formerly DERI) around open sourcing the SPARQL editor and other items come to a successful conclusion. The SPARQL editor especially would be of general interest to the RDF community. It is noteworthy that there is no SPARQL query builder in common use out there (even OpenLink's own open source iSPARQL has been largely (but not entirely!) overlooked and misunderstood, though it's been available as part of the OpenLink Ajax Toolkit for several years). OK, a query builder is useful when there is schema. But if the schema is an SQL one, as will be the case if RDF is adaptively stored, then any SQL query builder can be applied to the regular portion of the data. 40 years of calendar time and millennia of person years have gone into making SQL front ends and these will become applicable overnight; Virtuoso does speak SQL, as you may know.

I had the breakout session about the database work in LOD2. What will be done is clear enough, the execution side is very good, and our coverage of the infinite space of query optimization continues to grow. One more revolution for storage may come about, as suggested above. There is not very much to discuss, just to execute. So I used the time to explain how you run

SELECT  SUM ( l_extendedprice ) 
  FROM  lineitem
     ,  part 
 WHERE  l_partkey = p_partkey 
   AND  p_name LIKE '%green%'

Simple query, right? Sure, but application guys or sem-heads generally have no clue about how these in fact need to be done. I have the probably foolish belief that a little understanding of database, especially in the RDF space which does get hit by every query optimization problem, would be helpful. At least one would know what goes wrong. So I explained to Giovanni, who is in fact a good geek, that this is a hash join, and with only a little prompting he suggested that you should also put a Bloom filter in front of the hash. Good. So in the bar after dinner I was told I ought to teach. Maybe. But the students would have to be very fast and motivated. Anyway, the take-home message is that the DBMS must figure it out. In the SQL space this is easier, and of course, if most of RDF reduces to this, then RDF too will be more predictable in this department.

I talked with Martin Kaltenböck of the Semantic Web Company about his brilliant networking accomplishments around organizing the European Data Forum and other activities. Martin is a great ambassador and lobbyist for linked data across Europe. Great work, also in generating visibility for LOD2.

The EU in general, thanks in great part to Stefano Bertolo’s long term push in this direction, is putting increasing emphasis on measuring progress in the research it funds. This is one of the messages from the LOD2 review also. Database is the domain of performance race par excellence; the matters on that side are well attended to by LDBC and, of course, the unimpeachably authoritative TPC, among others. In other domains, measurement is harder, as it involves a human-curated ground truth for any extraction, linking, or other integration. There is good work in both Mannheim and Leipzig in these areas, and I may at some point take a closer look, but for now it is appropriate to stick to core database.

# PermaLink Comments [0]
02/26/2014 15:38 GMT-0500 Modified: 02/27/2014 10:57 GMT-0500
Initiating V7 Fast Track: "File Tables" feature now in Virtuoso Open Source!

The "file tables" feature first introduced in the TPC-H bulk load article is now available on the v7fasttrack clone of the Virtuoso repository on GitHub.

To check out —

$ git clone https://github.com/v7fasttrack/virtuoso-opensource.git v7fasttrack 

The v7fasttrack tree compiles just like the main Virtuoso tree. Its content is substantially the same today, except for the file tables feature. There is a diff with the main tree which now consists mostly of white space, since the Fast Track tree is automatically indented with the Linux indent utility each time it is updated, and the Virtuoso tree is not.

Ongoing maintenance and previews of new features will be added to this tree as and when they become available.

Let's now look at the "file tables" feature.

You can use any CSV file like a table, as described in the documentation. The TPC-H data generator (complete source ZIP; ZIP of just the dbgen source) is a convenient place to start to try things out.

To generate the qualification database, run —

dbgen -s 1

This makes 8 CSV files called *.tbl. You can use these scripts to load them into Virtuoso —

To verify the load, do —

SELECT  COUNT (*) 
  FROM  lineitem_f
;

SELECT  COUNT (*) 
  FROM  lineitem
;

To try different combinations of tables and CSV files, you can, for example, do —

SELECT  COUNT (*) 
  FROM  lineitem, 
        part_f 
 WHERE  l_partkey = p_partkey 
   AND  p_name LIKE '%green%'
;

This counts shipments of green parts, using the file as the part table. You can then replace the part_f with part to join against the database. The database will be a little faster but the file is also pretty fast since the smaller table (part) is on the build side of a hash join and the scan of lineitem is the same in either case.

You can now replace lineitem with lineitem_f and you will see a larger difference. This is still reasonably fast since the lineitem file is scanned in parallel.

You can try the different TPC-H queries against tables and files. To get the perfect plans you will need the analytics branch which will be made available shortly via this same GitHub channel.

You can also try RDFizing the files using the scripts in the Enterprise Linked Data article from earlier this year. The qualification database should go in about 15 minutes on a commodity server and make some 120M triples. In the article, the data came from another server, but it can just as well come from files. These two scripts from that article have been adapted for loading from files —

To try this, execute the following commands in iSQL —

LOAD sql_rdf.sql;

RDF_VIEW_SYNC_TO_PHYSICAL
  ( 'http://example.com/tpcd'
  , 1
  , 
  , "urn:example.com:tpcd"
  , 2
  , 0
  )
;

To verify the result —

sparql 
    SELECT  ?c 
            COUNT (*) 
     WHERE  { ?s  ?p  ?o } 
  GROUP BY  ?p 
  ORDER BY  DESC 2
;
# PermaLink Comments [0]
02/20/2014 14:43 GMT-0500
 <<     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |     >>
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006