Details
Virtuoso Data Space Bot
Burlington, United States
Subscribe
Post Categories
Recent Articles
Display Settings
|
Benchmarks, Redux (part 9): BSBM With Cluster
This post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata.
At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism.
For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn't run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration.
But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.
The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.
6 Cluster - Load Rates and Times |
Scale |
Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
100 Mt |
119,204 |
749 |
89 |
200 Mt |
121,607 |
1486 |
157 |
1000 Mt |
102,694 |
8737 |
979 |
6 Single - Load Rates and Times |
Scale |
Rate (quads per second) |
Load time (seconds) |
Checkpoint time (seconds) |
100 Mt |
74,713 |
1192 |
145 |
The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.
Throughput is as follows:
6 Cluster - Throughput (QMpH, query mixes per hour) |
Scale |
Single User |
16 User |
100 Mt |
7318 |
43120 |
200 Mt |
6222 |
29981 |
1000 Mt |
2526 |
11156 |
6 Single - Throughput (QMpH, query mixes per hour) |
Scale |
Single User |
16 User |
100 Mt |
7641 |
29433 |
200 Mt |
6017 |
13335 |
1000 Mt |
1770 |
2487 |
Below is a snapshot of status during the 6 Cluster 100 Mt run.
Cluster 8 nodes, 15 s.
25784 m/s 25682 KB/s 1160% cpu 0% read 740% clw threads 18r 0w 10i buffers 1133459 12 d 4 w 0 pfs
cl 1: 10851 m/s 3911 KB/s 597% cpu 0% read 668% clw threads 17r 0w 10i buffers 143992 4 d 0 w 0 pfs
cl 2: 2194 m/s 7959 KB/s 107% cpu 0% read 9% clw threads 1r 0w 0i buffers 143616 3 d 2 w 0 pfs
cl 3: 2186 m/s 7818 KB/s 107% cpu 0% read 9% clw threads 0r 0w 0i buffers 140787 0 d 0 w 0 pfs
cl 4: 2174 m/s 2804 KB/s 77% cpu 0% read 10% clw threads 0r 0w 0i buffers 140654 0 d 2 w 0 pfs
cl 5: 2127 m/s 1612 KB/s 71% cpu 0% read 9% clw threads 0r 0w 0i buffers 140949 1 d 0 w 0 pfs
cl 6: 2060 m/s 544 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141295 2 d 0 w 0 pfs
cl 7: 2072 m/s 517 KB/s 65% cpu 0% read 11% clw threads 0r 0w 0i buffers 141111 1 d 0 w 0 pfs
cl 8: 2105 m/s 522 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141055 1 d 0 w 0 pfs
The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes.
We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput.
Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.
This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.
The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.
Looking further at the 6 Cluster status we see the cluster wait (clw ) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.
We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update
-
Benchmarks, Redux (part 9): BSBM With Cluster (this post)
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
03/09/2011 17:54 GMT-0500
|
Modified:
03/14/2011 19:36 GMT-0500
|
Benchmarks, Redux (part 8): BSBM Explore and Update
We will here look at the Explore and Update scenario of BSBM. This presents us with a novel problem as the specification does not address any aspect of ACID.
A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.
We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning SERIALIZABLE isolation between INSERTs and DELETEs , and READ COMMITTED isolation between READ and any UPDATE transaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)
Virtuoso 7 Single, Full ACID (QMpH, query mixes per hour) |
Scale |
Single User |
16 User |
100 Mt |
9,969 |
65,537 |
200 Mt |
8,646 |
40,527 |
1000 Mt |
5,512 |
17,293 |
Virtuoso 6 Cluster, Full ACID (QMpH, query mixes per hour) |
Scale |
Single User |
16 User |
100 Mt |
5604.520 |
34079.019 |
1000 Mt |
2866.616 |
10028.325 |
Virtuoso 6 Single, Full ACID (QMpH, query mixes per hour) |
Scale |
Single User |
16 User |
100 Mt |
7,152 |
21,065 |
200 Mt |
5,862 |
16,895 |
1000 Mt |
1,542 |
4,548 |
Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs.
The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.
In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.
Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such "transactions," the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.
Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as READ COMMITTED . These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.
Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.
We look at the pertinent statistics:
SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
KEY_TABLE INDEX_NAME LOCKS WAITS WAIT_PCT DEADLOCKS LOCK_ESC WAIT_MSECS
=============== ============= ====== ===== ======== ========= ======== ==========
DB.DBA.RDF_QUAD RDF_QUAD_POGS 179205 934 0 0 0 35164
DB.DBA.RDF_IRI RDF_IRI 20752 217 1 0 0 16445
DB.DBA.RDF_QUAD RDF_QUAD_SP 9244 3 0 0 0 235
We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions.
We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.
Considering these results it does not seem crucial to "drop ACID," though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID.
Let us now see what the benchmark writes:
SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS
=========================== ============================ ========= ======= ======== ======= =========
DB.DBA.RDF_QUAD RDF_QUAD_POGS 763846891 237436 0 58040 228606
DB.DBA.RDF_QUAD RDF_QUAD 213282706 1991836 0 30226 1940280
DB.DBA.RDF_OBJ RO_VAL 15474 17837 115 13438 17431
DB.DBA.RO_START RO_START 10573 11195 105 10228 11227
DB.DBA.RDF_IRI RDF_IRI 61902 125711 203 7705 121300
DB.DBA.RDF_OBJ RDF_OBJ 23809053 3205963 13 636 3072517
DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 3237687 504486 15 340 488797
DB.DBA.RDF_QUAD RDF_QUAD_SP 89995 70446 78 99 68340
DB.DBA.RDF_QUAD RDF_QUAD_OP 19440 47541 244 66 45583
DB.DBA.VTLOG_DB_DBA_RDF_OBJ VTLOG_DB_DBA_RDF_OBJ 3014 1 0 11 11
DB.DBA.RDF_QUAD RDF_QUAD_GS 1261 801 63 10 751
DB.DBA.RDF_PREFIX RDF_PREFIX 14 168 1120 1 153
DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1807 200 11 1 200
The most dirty pages are on the POGS index, which is reasonable; values are spread out at random. After this we have the PSOG index, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.
* * *
Going back to BSBM, at least the following aspects of the benchmark have to be further specified:
-
Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.
-
Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state.
-
Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.
-
Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).
With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.
Benchmarks, Redux Series
-
Benchmarks, Redux (part 1): On RDF Benchmarks
-
Benchmarks, Redux (part 2): A Benchmarking Story
-
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
-
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
-
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
-
Benchmarks, Redux (part 6): BSBM and I/O, continued
-
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
-
Benchmarks, Redux (part 8): BSBM Explore and Update (this post)
-
Benchmarks, Redux (part 9): BSBM With Cluster
-
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
-
Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
-
Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
Benchmarks, Redux (part 13): BSBM BI Modifications
-
Benchmarks, Redux (part 14): BSBM BI Mix
-
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
03/09/2011 12:32 GMT-0500
|
Modified:
03/15/2011 17:18 GMT-0500
|
|
|