This document details how large RDF data set files can be bulk loaded into Virtuoso. The data sets may consist of multiple files, which may be loaded into one or several graphs.
- If your Virtuoso release is prior to the commercial
06.02.3129or open source
6.1.3releases, then the Virtuoso Bulk Loader functions need to be loaded manually.
- The directory containing the data set files must be included in the
DirsAllowedparameter defined in the virtuoso INI file, after which the Virtuoso server must be restarted.
- The Virtuoso Server should be appropriately configured to use sufficient memory and other system resources as detailed in the Virtuoso RDF Performance Tuning Guide, or the load may take an unacceptably long time, approaching forever.
- The files being loaded must have the following file extensions the
rdf_loader_run()function knows of: .rdf, .nt, .ttl, xml, owl, .trig, grdf and .nq, which can also be gzipped to save space ie have a .gz extension and the loader will automatically unzip them at runtime.
- The name of the RDF graph into which the data set(s) should be loaded can be specified through a text file placed in the same source directory as the source data files.
This will override the graph name specified in the
ld_dir_all()function call. The content of a file with the same name as a data file plus the
.graphfilename extension will be used for that data file (e.g.,
my_data.n3.graphwill be used with
my_data.n3). The content of a file named
global.graphwill be used for any and all other data files in that directory. Note: if the third parameter (
NULL, any data files that do not have a corresponding
.graphfile will not be loaded.
<source-file>.<ext> <source-file>.<ext>.graph global.graph
— e.g., —
myfile.n3 ;; RDF data myfile.n3.graph ;; Contains Graph IRI name into which RDF data from myfile.n3 will be loaded global.graph ;; Contains Graph IRI name into which RDF data from any files that do not have a specific graph name file will be loaded
- Place the graph IRI, , e.g.,
http://dbpedia.org, in the
isqlto register the file(s) to be loaded by running the appropriate function, e.g. --
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');
ld_dir()to load only from the specified directory, excluding any subdirectories --
SQL> ld_dir ('<source-filename-or-directory>', '<file name pattern>', 'graph iri');
ld_dir_all()to load from the specified directory, including any and all subdirectories --
SQL> ld_dir_all ('<source-filename-or-directory>', '<file name pattern>', 'graph iri');
- The table
DB.DBA.load_listcan be used to check the list of data sets registered for loading, and the graph IRIs into which they will be or have been loaded. The
ll_statefield can have three values: 0 indicating the data set is to be loaded; 1 the data set load is in progress; or 2 the data set load is complete:
SQL> select * from DB.DBA.load_list; ll_file ll_graph ll_state ll_started ll_done ll_host ll_work_time ll_error VARCHAR NOT NULL VARCHAR INTEGER TIMESTAMP TIMESTAMP INTEGER INTEGER VARCHAR _____________________________________________________________________________________________________________________________________ ./dump/d1/file1.n3 http://file1 2 2010.10.20 9:21.18 0 2010.10.20 9:21.18 0 0 NULL NULL ./dump/d2/file2.n3 http://file2 2 2010.10.20 9:21.18 0 2010.10.20 9:21.18 0 0 NULL NULL ./dump/file.n3 http://file 2 2010.10.20 9:21.18 0 2010.10.20 9:21.18 0 0 NULL NULL 3 Rows. -- 1 msec. SQL>
- Finally, perform the bulk load of all data by executing the
rdf_loader_run()function prototype is:
rdf_loader_run (in max_files integer := null, in log_enable int := 2)
One of the side effects of the default
log_enable = 2setting is that triggers are not enabled to speed the loading of data. If triggers are required for RDF Graph replication between nodes etc. then the log_enable mode should be set to 3 when calling the
rdf_loader_run()function as follows:
- Note: the
On a multi-core server machine it is recommended datasets are split into multiple file and registered in the
DB.DBA.load_list table with the
Once registered for load multiple
rdf_loader_run() functions can be run, one per available core, for parallel loading of data and hence maximum load speed.
A typical script that can be run from command line is of the form:
$ more bulk_load.sh isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & isql 1111 dba dba exec="rdf_loader_run();" & wait isql 111 dba dba exec="checkpoint;" $
and run with the command:
- All RDF loader threads can be stopped using the command
rdf_load_stop(), at which point all currently running threads will be allowed to complete and then exit:
- Once the
rdf_loader_run()is complete, you can check the
DB.DBA.load_listto confirm all data sets were loaded successfully. This is indicated by an
- On a Virtuoso Clustered Server the "
cl_exec('rdf_ld_srv(log_enable)')" commands (where
3, as with the
rdf_loader_run()function) can be used to invoke a single "
rdf_loader_run()" on each node of the cluster:
SQL> cl_exec('rdf_ld_srv()'); Done. -- 265956 msec. SQL>
- Example of single file load
- Example of multiple file load
- Example of Dbpedia datasets load
- Virtuoso RDF Bulk Update "with_delete" option
- How can I determine the time taken to load datasets with RDF Bulk Loader