Extending SPARQL IRI Dereferencing with RDF Mappers
The Virtuoso SPARQL engine (called for brevity just SPARQL bellow) supports IRI Dereferencing, however it understands only RDF data, that is it can retrieve only files containing RDF/XML, turtle or N3 serialized RDF data, if format is unknown it will try mapping with built-in WebDAV metadata extractor. In order to extend this feature with dereferencing web or file resources which naturally don't have RDF data (like PDF, JPEG files for example) is provided a special mechanism in SPARQL engine. This mechanism is called RDF mappers for translation of non-RDF data files to RDF.
In order to instruct the SPARQL to call a RDF mapper it needs to be registered and it will be called for a given URL or MIME type pattern. In other words, when unknown for SPARQL format is received during URL dereferencing process, it will look into a special registry (a table) to match either the MIME type or IRI using a regular expression, if match is found the mapper function will be called.
Registry
The table DB.DBA.SYS_RDF_MAPPERS is used as registry for registering RDF mappers.
create table DB.DBA.SYS_RDF_MAPPERS (
RM_ID integer identity, -- mapper ID, designate order of execution
RM_PATTERN varchar, -- a REGEX pattern to match URL or MIME type
RM_TYPE varchar default 'MIME', -- what property of the current resource to match: MIME or URL are supported at present
RM_HOOK varchar, -- fully qualified PL function name e.q. DB.DBA.MY_MAPPER_FUNCTION
RM_KEY long varchar, -- API specific key to use
RM_DESCRIPTION long varchar, -- Mapper description, free text
RM_ENABLED integer default 1, -- a flag 0 or 1 integer to include or exclude the given mapper from processing chain
primary key (RM_TYPE, RM_PATTERN))
;
The current way to register/update/unregister a mapper is just a DML statement e.g. INSERT/UPDATE/DELETE.
Execution order and processing
As said above, when SPARQL retrieves a resource with unknown content it will look in the mappers registry and will loop over every record having RM_ENABLED flag true. The sequence of look-up is based on ordering by RM_ID column. For every record it will either try matching the MIME type or URL against RM_PATTERN value and if there is match the function specified in RM_HOOK column will be called. If the function doesn't exists or signal an error the SPARQL will look at next record.
When it stops looking? It will stop if value returned by mapper function is positive or negative number, if the return is negative processing stops with meaning no RDF was supplied, if return is positive the meaning is that RDF data was extracted, if zero integer is returned then SPARQL will look for next mapper. The mapper function also can return zero if it is expected next mapper in the chain to get more RDF data.
If none of the mappers matches the signature (MIME type nor URL) the built-in WebDAV metadata extractor will be called.
Extension function
The mapper function is a PL stored procedure with following signature:
THE_MAPPER_FUNCTION_NAME (
in graph_iri varchar,
in origin_uri varchar,
in destination_uri varchar,
inout content varchar,
inout async_notification_queue any,
inout ping_service any,
inout keys any
)
{
-- do processing here
-- return -1, 0 or 1 (as explained above in Execution order and processing section)
}
;
Parameters
- graph_iri - the target graph IRI
- origin_uri - the current URI of processing
- destination_uri - get:destination value
- content - the resource content
- async_notification_queue - if INI parameter PingService is specified in SPARQL section in the INI file, this is a pre-allocated asynchronous queue to be used to call ping service
- ping_service - the URL of the ping service configured in SPARQL section in the INI in PingService parameter
- keys - a string value contained in the RM_KEY column for given mapper, can be single string or serialized array, generally can be used as mapper specific data.
Return value
- 0 - no data was retrieved or some next matching mapper must extract more data
- 1 - data is retrieved, stop looking for other mappers
- -1 - no data is retrieved, stop looking for more data
RDF Mappers package content
The Virtuoso supply as a rdf_mappers_dav VAD package a cartridge for extracting RDF data from certain popular Web resources and file types. It can be installed (if not already) using VAD_INSTALL function, see the VAD chapter in documentation on how to do that.
HTTP-in-RDF
Maps the HTTP request response to HTTP Vocabulary in RDF, see http://www.w3.org/2006/http#.
This mapper is disabled by default. If it's enabled , it must be first in order of execution.
Also it always will return 0, which means any other mapper should push more data.
HTML
This mapper is composite, it looking for metadata which can specified in a HTML pages as follows:
- Embedded/linked RDF
- scan for meta in RDF <link rel="meta" type="application/rdf+xml"
- RDF embedded in xHTML (as markup or inside XML comments)
- Micro-formats
- GRDDL - GRDDL Data Views: RDF expressed in XHTML and XML: http://www.w3.org/2003/g/data-view#
- eRDF - http://purl.org/NET/erdf/profile
- RDFa
- hCard - http://www.w3.org/2006/03/hcard
- hCalendar - http://dannyayers.com/microformats/hcalendar-profile
- hReview - http://dannyayers.com/micromodels/profiles/hreview
- relLicense - CC license: http://web.resource.org/cc/schema.rdf
- Dublin Core (DCMI) - http://purl.org/dc/elements/1.1/
- geoURL - http://www.w3.org/2003/01/geo/wgs84_pos#
- Google Base - OpenLink Virtuoso specific mapping
- Ning Metadata - --..--
- Feeds extraction
RSS/RDF: http://purl.org/rss/1.0/
SIOC: http://rdfs.org/sioc/ns#
AtomOWL: http://atomowl.org/ontologies/atomrdf#
- xHTML metadata transformation using FOAF (foaf:Document) and Dublin Core properties (dc:title, dc:subject etc.)
FOAF: http://xmlns.com/foaf/0.1/
The HTML page mapper will look for RDF data in order as listed above, it will try to extract metadata on each step and will return positive flag if any of the above step give a RDF data. In case where page URL matches some of other RDF mappers listed in registry it will return 0 so next mapper to extract more data. In order to function properly, this mapper must be executed before any other specific mappers.
Flickr URLs
This mapper extracts metadata of the Flickr images, using Flickr REST API. To function properly it must have configured key. The Flickr mapper extracts metadata using: CC license, Dublin Core, Dublin Core Metadata Terms, GeoURL, FOAF, EXIF: http://www.w3.org/2003/12/exif/ns/ ontology.
Amazon URLs
This mapper extracts metadata for Amazon articles, using Amazon REST API. It needs a Amazon API key in order to be functional.
eBay URLs
Implements eBay REST API for extracting metadata of eBay articles, it needs a key and user name to be configured in order to work.
Open Office (OO) documents
The OO documents contains metadata which can be extracted using UNZIP, so this extractor needs Virtuoso unzip plugin to be configured on the server.
Yahoo traffic data URLs
Implements transformation of the result of Yahoo traffic data to RDF.
iCal files
Transform iCal files to RDF as per http://www.w3.org/2002/12/cal/ical# .
Binary content, PDF, PowerPoint
The unknown binary content, PDF and MS PowerPoint files can be transformed to RDF using Aperture framework (http://aperture.sourceforge.net/). This mapper needs Virtuoso with Java hosting support, Aperture framework and MetaExtractor.class installed on the host system in order to work.
The Aperture framework & MetaExtractor.class must be installed on the system before to install the RDF mappers package. If the package is already installed, then to activate this mapper you can just re-install the VAD.
Setting-up Virtuoso with Java hosting to run Aperture framework
- Install Virtuoso with Java hosting
- Download the Aperture framework from http://aperture.sourceforge.net
- unpack in the Virtuoso working directory e.q. where database file is, make a symbolic link 'lib' to it
- configure in the INI Parametres section JavaClasspath = lib/sesame-2.0-alpha-3.jar:lib/openrdf-util-crazy-debug.jar:lib/htmlparser-1.6.jar:lib/activation-1.0.2-upd2.jar:lib/bcmail-jdk14-132.jar:lib/poi-scratchpad-3.0-alpha2-20060616.jar:lib/openrdf-model-2.0-alpha-3.jar:lib/jacob-1.10-pre4.jar:lib/bcprov-jdk14-132.jar:lib/demork-2.0.jar:lib/commons-codec.jar:lib/fontbox-0.1.0-dev.jar:lib/pdfbox-0.7.3.jar:lib/applewrapper-0.1.jar:lib/junit-3.8.1.jar:lib/winlaf-0.5.1.jar:lib/aperture-test-2006.1-alpha-3.jar:lib/openrdf-util-fixed-locking.jar:lib/commons-logging-1.1.jar:lib/mail-1.4.jar:lib/aperture-2006.1-alpha-3.jar:lib/poi-3.0-alpha2-20060616.jar:lib/ical4j-cvs20061019.jar:lib/openrdf-util-2.0-alpha-3.jar:lib/rio-2.0-alpha-3.jar:lib/poi-contrib-3.0-alpha2-20060616.jar:lib/aperture-examples-2006.1-alpha-3.jar:.
- Make sure MetaExtractor.class is in the Virtuoso working directory
- Start the Virtuoso server with java hosting support
- connect with ISQL tool and check if installation is complete:
SQL> DB.DBA.import_jar (NULL, 'MetaExtractor', 1);
Done. -- 466 msec.
SQL> select "MetaExtractor"().getMetaFromFile ('some_pdf_in_server_working_dir.pdf', 5);
... some RDF must be returned ...
Important: the above is verified to work with aperture-2006.1-alpha-3 on Linux system. For different version of Aperture of operation system this may need some adjustments e.g. to re-build MetaExtractor.class & changes to CLASSPATH etc.
Examples & tutorials
How to write own RDF mapper? Look at Virtuoso tutorial on this subject http://demo.openlinksw.com/tutorial/rdf/rd_s_1/rd_s_1.vsp .
References