We have played around with LOD data sets and Virtuoso Column Store for the past several months. I will here give a few numbers and comment on some different platform comparisons that we have made. The answer at the end of this is how to size a system for often-changing web-style data. The conclusion is a data-to-RAM ratio that gives an acceptable working set without driving the price up by forcing 100% RAM residence.

The experiment is loading Sindice web crawls. The platform is 2 x Xeon 5520 and 144G RAM. The initial load rate is 200-180Kt, and drops to 100Kt at 5Gt because of I/O. The system is Virtuoso Column Store configured to run as 4 processes and 32 partitions, all on the same box. After 5Gt, we see just more I/O and going further is not relevant; one runs CPU-bound or not at all.

We use 4 Crucial SSDs in the setup. The hot structures like the RDF quad indices are on SSD, and the cold ones are on hard disk. A cold structure is a write-only index like the dictionary of literals (id to lit).

For bulk load, SSDs turn out not to be particularly useful. For a cold start on the other hand, SSDs cut warmup time of 144G RAM from over half an hour to a couple of minutes. It is possible that Intel SSDs would also help with bulk load, but this has not been tried. The SSD problem during bulk load is that these do not write very fast, and while there are writes in queue, read latency goes up; so under a constant write load, the SSD's famous instantaneous random read no longer works.

The fragment considered in the example is 4.95Gt: 8.1M pages worth of quads; 12.7M of literals and iris; and 4.71M of full text index. A page is 8KB. The files on disk contain empty pages, but these do not matter since they do not take up RAM. The quad indices take 13.4 bytes/quad. The row-wise equivalent used to be 38 or so bytes/quad with similar data. Two-thirds of the IRI and literal string data can benefit from column-wise stream compression. (This was not used but if it were, we could count on a 50% drop in size for the data affected, so instead of 12.7M pages, we could maybe get 8.5M on a good day. This could be worth doing but is not a priority.) The system was configured to have 12M database pages in RAM, so a little under half the database pages of the set fit in RAM at one time; thus one cannot call this a memory-only setup. Due to the locality in the unusually non-local data, this is as far as secondary storage can reach without becoming an over-2x slowdown. In practice, we are talking about under 1% of rows accessed coming from secondary storage, but that alone means half throughput.

We note that this data set represents the worst that we have seen. It has 129M distinct graphs, 38 t/g. Regular data like the synthetic benchmark sets take half the space per quad. This is about a third of a Sindice crawl; the other two-thirds look the same as far as we looked.

So if you are interested in hosting data like this, you can budget 144GB RAM for every 5Gt. Do not try it with anything less. Budgeting double this is wise, so that you have space to cook the data; this is important since in order to do things with it, one needs to at least copy things for materializing transformations.

If you are budget-constrained and hosting very regular content like UniProt, you can budget maybe 144GB RAM for every 10Gt.

As for CPU, this does not matter as much as long as you do not go to disk. Just for load speed, Dbpedia is loaded in 300s on a cluster of eight (8) dual AMD 2378 boxes at 2.6GHz (total 8 cores per host, so 64 cores in the cluster), and in 945s on one (1) dual Xeon 5520 box at 2.26GHz (total 8 cores in the host). Intel makes much better CPUs, as we see. Both scenarios are 100% in RAM. For even more regular data, the load rates are a bit higher: 1.3Mt/s for the AMD cluster, and 300Kt/s for the Xeon host.

The interconnect for the AMD cluster is 1 x gigE but this does not matter for load. For CPU-bound cross-partition JOINs, 1 or 2 x gigE is insufficient; 4 x gigE might barely make it; InfiniBand should be safe. When running cross-partition JOINs, a single 8-core Xeon box generates about 300MB/s of interconnect traffic; a gigE connection can maybe take 50MB/s with some luck.

Intel E5 is not dramatically better than Nehalem but this is something we will see in a while when we make measurements with real equipment. Prior to the E5 release, we tried Amazon EC2 CC2 ("Cluster Compute Eight Extra Large Instance" -- 2x8 core E5, 2.66GHz). The results were inconclusive; it never did more than 1.9x better than Xeon 5520 even when running an empty loop (i.e., recursive Fibonacci function in SQL, no cache misses, no I/O). With a database JOIN, 1.3x better is the best we saw. But this must be the fault of Amazon and not of E5.

We also tried AMD "Magny-Cours", but for 32 cores against 8 it never did over 2x better, more like 1.4x often enough, and and single thread speed was 50% worse, so not a good buy. We did not find a Bulldozer to try, and did not feel like buying one since the reviews did not promise more core speed over the Magny-Cours.

It seems that especially with Column Store, we are truly CPU-bound and not memory-latency- or bandwidth-bound. This is based on the observation that a Xeon 5620 with 2 of 3 memory channels populated loads BSBM data only 10% faster than the same with 1 of 3 channels populated, with CPU affinity set on a dual socket system.

So, if you have a choice between a $2K processor (E5-2690) and a $600 processor (E5-2630), buy the cheaper one and get RAM with the money saved. $1440 buys 128G in $90 8G DIMMs. Then buy E5 boards with 24 DIMMs -- one for every 7Gt of web crawl data. If your software licenses are priced per core, getting higher-clock 4-core E5’s might make sense.

While on the subject of bytes and quads/triples, we note that Bigdata®'s recent announcement says up to 50 billion triples per single server. Franz loaded at a good 800+ Kt/s rate up to a trillion triples. One is led to think from the spec that this was with less than full cpu but still with highly local data, considering 1.5 bytes a triple would hit very heavy I/O otherwise. Their statement to the effect of LUBM-like data corroborates this, so we are not talking about exactly the same thing.

So if you compare the claims, I am talking about running CPU-bound on the worst data there is. Franz and Bigdata® do not specify, so it is hard to compare. LOD2 should in principle publish actual metrics with at least Bigdata®; Franz is not participating in these races.

We may publish some more detailed measurements with more varied configurations later. The thing to remember is minimum 144GB RAM for every 5Gt of web crawls, if you want to load and refresh in RAM.