Before Christmas, I wrote about a rerun of BSBM to check how it works before doing it on 500 Giga-triples ("Gt") at CWI. Now we can talk about the experiment with CWI's Scilens cluster. The specs are in the previous post. This is a cautionary tale about large data on one hand, and about high load speed on the other.

The BSBM generator and driver are an ever-fresh source of new terrors. The nature of the gig is that you have a window to do your experiment in, and that involves first generating the test data. It is somewhere around 3 TB of gzipped files. It took a whole week to make the files. During that time of course you want to anticipate what's going to break with the queries. So while the generator was going, we loaded 50 renamed replicas of the 10 Gt dataset. At partial capacity, we may add, because 4 boxes had half memory taken by the BSBM generator. We hate that program. Of course nobody gives a damn about it so it has been maintained in the worst way possible; for example, the way its cluster version generates slices of data is by having every instance actually generate the full data set, but only write 1 out of so many items to the output. So no amount of capacity will make it faster. As for BSBM itself, if you generate 10 Gt once and occasionally use this as test data, it does not inconvenience you so much. Then, of course, the test driver was patched to generate queries against renamed replicas of a dataset. But then the new driver would not read the dataset summary files made by the previous driver, because of Java class versions. 8 hours to regenerate 10 Gt. A real train wreck. This is by far not the end of it but we are out of space. So on with it; may that program be buried.

In the end, the 2000 gz files with the 500 Gt in them were complete. Then it turns out each file has tens of millions of namespace prefixes at the beginning. So, starting to load a file grows the process by some 9 GB just for the prefixes. So, out of 256 GB of RAM per box, there are about 72 GB taken by the prefixes, if you load 8 files in parallel on each. Well, one could do a sed script to unzip, expand the prefixes, and rezip, and the file would not be any bigger; but it would be a day to run.

So, anyway with 12 boxes, 24 processes, and (in principle) 384 threads, the load rate is between 3 and 4 million triples per second ("Mt/s"). With 2 boxes, it is 630 Kt/s, so you would say this is scalable. Near enough to linear; the 2 boxes have 12 cores and 2.3GHz, Scilens has 16 at 2.0GHz; close enough.

For the 3-4 Mt rate, there is an average of 200 threads running. This is not full platform, as there's the 2nd thread of each core idle for the most part. Adding the second thread usually adds some 30% throughput. A high of 5 Mt/s could be had if going to full CPU, but doubling the files being loaded would run out of memory because of the namespace prefixes. See, it is sheer luck that the BSBM thing, inept as it is, is still marginally usable, despite the prefixes and the horrible generator. A bit worse still, and it would have been a non-starter. It comes from the times when RDF just meant inept database, so scalability clearly was not in its design objectives.

With 96 files being loaded across the cluster, we got the run stats below for a couple of 4 minute windows. In each, the data size at time of the sample is between 50 Gt and 100 Gt. The long line is the cluster status summary; the tables below are load rates in the windows between timestamps, so, growth in triple count as triples per second (tps) since the previous sample.

Cluster 24 nodes, 240 s. 18866 m/s 692017 KB/s 21842% cpu 7% read 95% clw threads 356r 0w 114i buffers 99250961 97503789 d 2275 w 0 pfs
load rate
(tps)
timestamp
3,853,915.028323892     2014-01-04 08:38:36 +0000
4,245,681.678456353 2014-01-04 08:38:33 +0000
3,680,757.080973009 2014-01-04 08:38:06 +0000
4,138,599.125958298 2014-01-04 08:38:03 +0000
4,887,272.575808064 2014-01-04 08:37:36 +0000
4,093,772.082515462 2014-01-04 08:37:33 +0000
4,399,343.552149284 2014-01-04 08:37:06 +0000
4,184,758.045998296 2014-01-04 08:37:03 +0000
3,884,665.444851716 2014-01-04 08:36:36 +0000
4,197,270.027036035 2014-01-04 08:36:33 +0000

Some hours later --

Cluster 24 nodes, 240 s. 14601 m/s 506784 KB/s 19721% cpu 61% read 1310% clw threads 374r 0w 126i buffers 189886490 107378792 d 1983 w 18 pfs
load rate
(tps)
timestamp
3,273,757.708076397     2014-01-04 11:49:53 +0000
3,274,119.596013466 2014-01-04 11:49:53 +0000
3,318,539.715342822 2014-01-04 11:49:23 +0000
3,318,701.609946335 2014-01-04 11:49:23 +0000
3,127,730.142328589 2014-01-04 11:48:53 +0000
3,127,731.608946369 2014-01-04 11:48:53 +0000
3,273,572.647578414 2014-01-04 11:48:23 +0000
3,273,622.779240692 2014-01-04 11:48:23 +0000
2,872,466.21779274 2014-01-04 11:47:53 +0000
2,872,495.383487217 2014-01-04 11:47:53 +0000

Pretty good. I don't know of others coming even close.

Next we will look at query plans and scalability in query processing.