<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>

<title>Benchmarks, Redux (part 9): BSBM With Cluster</title><link>http://virtuoso.openlinksw.com:443/blog/vdb/blog/?id=1676</link><description>This post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata.

At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism.

For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn&#39;t run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration.

But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.

The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.


	
		6 Cluster - Load Rates and Times
	
	
		Scale
		Rate  (quads per second)
		Load time  (seconds)
		Checkpoint time  (seconds)
	
	
		100 Mt
		 119,204 
		 749 
		 89 
	
	
		200 Mt
		 121,607 
		 1486 
		 157 
	
	
		1000 Mt
		 102,694 
		 8737 
		 979 
	



	
		6 Single - Load Rates and Times
	
	
		Scale
		Rate  (quads per second)
		Load time  (seconds)
		Checkpoint time  (seconds)
	
	
		100 Mt
		 74,713 
		 1192 
		 145 
	




The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.

Throughput is as follows:


	
		 6 Cluster - Throughput  (QMpH, query mixes per hour) 
	
	
		Scale
		 Single User 
		 16 User 
	
	
		100 Mt
		 7318 
		 43120 
	
	
		200 Mt
		 6222 
		 29981 
	
	
		1000 Mt
		 2526 
		 11156 
	



	
		 6 Single - Throughput  (QMpH, query mixes per hour) 
	
	
		Scale
		 Single User 
		 16 User 
	
	
		100 Mt
		 7641 
		 29433 
	
	
		200 Mt
		 6017 
		 13335 
	
	
		1000 Mt
		 1770 
		 2487 
	



Below is a snapshot of status during the 6 Cluster 100 Mt run.


 
Cluster 8 nodes, 15 s.
       25784 m/s  25682 KB/s  1160% cpu  0% read  740% clw  threads 18r 0w 10i  buffers 1133459  12 d  4 w  0 pfs
cl 1:  10851 m/s   3911 KB/s   597% cpu  0% read  668% clw  threads 17r 0w 10i  buffers  143992   4 d  0 w  0 pfs
cl 2:   2194 m/s   7959 KB/s   107% cpu  0% read    9% clw  threads  1r 0w  0i  buffers  143616   3 d  2 w  0 pfs
cl 3:   2186 m/s   7818 KB/s   107% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140787   0 d  0 w  0 pfs
cl 4:   2174 m/s   2804 KB/s    77% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  140654   0 d  2 w  0 pfs
cl 5:   2127 m/s   1612 KB/s    71% cpu  0% read    9% clw  threads  0r 0w  0i  buffers  140949   1 d  0 w  0 pfs
cl 6:   2060 m/s    544 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141295   2 d  0 w  0 pfs
cl 7:   2072 m/s    517 KB/s    65% cpu  0% read   11% clw  threads  0r 0w  0i  buffers  141111   1 d  0 w  0 pfs
cl 8:   2105 m/s    522 KB/s    66% cpu  0% read   10% clw  threads  0r 0w  0i  buffers  141055   1 d  0 w  0 pfs

 



The main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes. 

We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput. 


Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.

This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.


The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.

Looking further at the 6 Cluster status we see the cluster wait (clw) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.


We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.




Benchmarks, Redux Series


 Benchmarks, Redux (part 1): On RDF Benchmarks


 Benchmarks, Redux (part 2): A Benchmarking Story


 Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore


 Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire


 Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs 


 Benchmarks, Redux (part 6): BSBM and I/O, continued


 Benchmarks, Redux (part 7): What Does BSBM Explore Measure?


 Benchmarks, Redux (part 8): BSBM Explore and Update 


Benchmarks, Redux (part 9): BSBM With Cluster (this post)


 Benchmarks, Redux (part 10): LOD2 and the Benchmark Process


 Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks


  Benchmarks, Redux (part 12): Our Own BSBM Results Report


  Benchmarks, Redux (part 13): BSBM BI Modifications 


  Benchmarks, Redux (part 14): BSBM BI Mix 


  Benchmarks, Redux (part 15): BSBM Test Driver Enhancements 

</description><pubDate>Wed, 09 Mar 2011 22:54:50 GMT</pubDate><generator>Virtuoso Universal Server 08.03.3334</generator><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Virtuoso Data Space Bot</dc:creator><image><title>Benchmarks, Redux (part 9): BSBM With Cluster</title><url>http://virtuoso.openlinksw.com:443/weblog/public/images/vbloglogo.gif</url><link>http://virtuoso.openlinksw.com:443/blog/vdb/blog/?id=1676</link><description>A great place to track Virtuoso&#39;s rapid evolution.</description><width>88</width><height>31</height></image>

</channel>
</rss>
