Virtuoso Cluster Benchmark AMI


This document contains instructions for deploying the Virtuoso Cluster
Benchmarks AMI.

Note that these instructions do not apply to the open source v7fasttrack 
benchmark AMI.

The AMI comes with a preconfigured TPC-H at 100G and at 1000G scale.  The
100G scale is intended to be deployed on 2 machines and the 1000G on 4
machines.  In each case the instances should be of the type R3.8xlarge.  
The instances must be launched in a virtual private cloud (AWS VPC).  The 
VPC should have a security setting that allows unrestricted connections.


The setups can be used for running TPC-H or any other workload.  Follow 
the steps below to get started.


Cluster Setup 


Launch 2 or 4 instances R3.8xlarge instances.  The instances must all
be in the same placement group and the same virtual private cloud
(VPC) and the VPC must have unrestricted security settings.

After launching the instances, put their private IP addresses in a file 
called hosts.conf.  The instances will be named a1, a2, a3 and a4.

The hosts.conf file is of the form:

Number_of_machines = 4
172.30.3.89
172.30.3.88
172.30.3.87
172.30.3.86


Copy this file to the ec2-user's home in the first machine.
Log in to the first machine and do:

$ sudo setup_one.sh
$ setup_all.sh

The second command will create file systems on all machines and may take 
a few minutes.


You can verify that there is a no-login ssh connection between any two 
of the hosts a1, a2, and so forth.

The Virtuoso cluster server bundled with the image needs a license
file.  You may obtain the requisite number of license files, one per
machine instance from OpenLink sales/support.  You should specify
that the license file is for use with the cluster benchmarks AMI.

Once you have the license files, copy a different file to the /etc/oplmgr 
directory of each  machine.

scp virtuoso.lic.2 a2:/etc/oplmgr/

Then start the license manager on the machine:

ssh a2 "cd /etc/oplmgr; /usr/sbin/oplmgr +start"


Running a Virtuoso Cluster 

There are two ready made benchmark configurations, one with two
machines in /home/tpch100c and one with 4 machines in /home/tpch1000c.
In addition to these there are all the single server benchmark
directories of the open source benchmarks AMI.

You can use these setups for running TPC-H at the indicated scale or any 
other cluster workload.

The setup is configured two run two server processes per machine, each
scoped to its own NUMA node.  The SQL listener of the first server
process is 1201 on the first machine.  The second process is at 1202
on the first machine, the third is at 1203 on the second machine and
so forth.

On the first host (a1)

$ cd /home/tpch1000c  # or /home/tpch100c for the 2 machine configuration
$ ./init_all.sh 

This command starts Virtuoso for running the cluster configurator
utility which generates configuration files for all constituent
processes based on template files *.tpl in the same directoory.  The
init_all.sh script then runs the setup files, creating data and
running directories on each machine in the cluster.  After completing
init_all.sh, the Virtuoso cluster is ready to be started.


To simply run Virtuoso without TPC-H, do ./start.sh in the directory
where the init_all.sh was run.  This will create an empty database
split across the nodes of the cluster when done for the first time.

You can follow the progress of the start in /home/tpch1000r/1/virtuoso.log.  
Initialization takes about one minute.

When the cluster is online, you can connect to it with 

$ isql 1201


To run TPC-H, you must first generate the dataset.  This is done with 

$ ./dbgen.sh

in the appropriate directory.  This starts a parallel, distributed set 
of dbgen processes that will generate a fraction of the data on each of 
the machines.  For 100G this takes a few minutes, for 1000G this is 
about half an hour.


Once the data generation is complete, to start the database server and 
load the TPC-H dataset, do

./load.sh

This starts the servers, defines the schema and bulk loads the data.
For 100G, this takes about 4 minutes, and about 34 minutes for 1000G.


You can follow the progress of data loading by connecting to the servers 
and do interactive SQL.

$ isql 1201
-- show the list of presently loading files.  You can do this at intervals.
SQL> select ll_file from load_list where ll_state = 1;
-- To see cluster traffic and CPU utilization, you can do:
SQL> status ('cluster');
-- or , for per process statistics
SQL> status ('cluster_d');
-- from time to time you can check the table counts, e.g.
SQL> select count (*) from lineitem;

The load.sh script returns when the data loading is complete.


After the data loading is complete, you can run the benchmark with 

./run.sh

in the appropriate directory.  This expects the servers to be up, in
the post load state.  For 100G this runs 2 sets of power + throughput
tests, for 1000G this runs 3 sets.  The 100G is has 5 stream of
throughput test, the 1000G setup has 7.

The 100G run takes about 3 minutes, the 1000G about 33 minutes. 

At the end of the run, the files report1.txt to report3.txt appear in
the running directory.  These contain the regular numerical quantities
summaries for each power+throughput test pair.

After the run is completed, you may do another run by stopping the servers 
without a checkpoint, so that they come up in the post bulk load state.

Connect with isql:

# isql 1201
SQL> cl_exec ('raw_exit ()');

Then delete the transaction logs by 

ssh <machine> "rm /1s1/dbs/*.trx"


for each machine, a1, a2, ...


To start the Virtuoso cluster, do 

$ ./start.sh 

in the directory of the configuration, /home/tpch100c or /home/tpch1000c 
on the first machine, named a1 in /etc/hosts.