Virtuoso Sponger
Abstract
Many past commentators on the Semantic Web have argued that its maturation has been slowed by the classic "chicken-and-egg" problem. In order to stimulate the development of Semantic Web applications, there needs to be a critical mass of RDF data. Without these applications, this body of RDF data will not be created. In response to this need, a new class of tools emerged, so called /RDFizers,/ for transforming existing data into RDF.
Whether or not these concerns remain valid, indeed many would argue that the Semantic Web is growing rapidly, RDFizers are crucial enablers for driving the transition of the traditional Document-Web into the emerging Semantic Data-Web.
One such RDFizer is the "Sponger". Introduced in Virtuoso Universal Server 5.0, the Sponger provides an as yet unrivaled set of tools for converting non-RDF data into RDF, packaged in an easily extensible framework, with tight integration to the Virtuoso RDF Quad Store. This white-paper provides an in-depth description of these facilities.
Other facets of Virtuoso's Semantic Web related feature set are explored in the accompanying white papers " RDF Views of SQL Data " and "Deploying RDF Linked Data via Virtuoso Universal Server".
Table of Contents
- Virtuoso Sponger
- Abstract
- Table of Contents
- What Is The Sponger?
- Sponger Benefits
- Using The Sponger
- SPARQL Query Processor
- SPARQL Extensions for IRI Dereferencing of FROM Clauses
- SPARQL Extensions for IRI Dereferencing of Variables
- RDF Proxy Service
- RDF Client Applications
- ODS-Briefcase (Virtuoso )
- Sponger and ODS-Briefcase Structured Data Extractor
- Directly via Virtuoso PL
- Consuming the Generated RDF Structured Data
- Data Sources Supported by the Sponger
- How Does It Work?
- Metadata Extraction
- Extraction Pipeline
- Mapping to Ontologies
- SIOC as a Data Space Glue Ontology
- Proxy Service Caching
- Sponger Architecture
- Metadata Extractors
- Ontology Mappers
- Cartridge Registry
- Cartridge Invocation
- Sponger Configuration Using Conductor
- Cartridge Packaging & Deployment
- XSLT Templates
- GRDDL Mappings
- Custom Cartridges
- Cartridge Hook Prototype
- Example Cartridge Implementations
- Basic Sponger Cartridge
- Flickr Cartridge
- Sponger Permissions
- Custom Resolvers
- Sponger Usage Examples
- RDF Proxy Service
- SPARQL Processor
- Custom Cartridge
- Appendix A: Ontologies Supported by ODS-Briefcase
- Appendix B: RDF Cartridges VAD Package
- HTTP in RDF
- XHTML and Feeds
- Flickr Images / URLs
- Amazon Articles / URLs
- eBay Articles / URLs
- Documents
- Yahoo Traffic Data URLs
- iCalendar Files
- Binary Content, PDF & Powerpoint Files
- Appendix C: Configuring the Aperture Framework
- Appendix D: Deprecated Naming Conventions
- Glossary
What Is The Sponger?
Virtuoso 5.0 introduced the /Sponger/, built-in RDF middleware for transforming non-RDF data into RDF "on the fly". Its goal is to use non-RDF Web data sources as input, e.g. (X)HTML Web Pages, (X)HTML Web pages hosting microformats, and even Web services such as those from Google, Del.icio.us, Flickr, etc., and create RDF as output. The implication of this facility is that you can use non-RDF data sources as Semantic Web data sources. Architecturally, it is comprised of a number of Sponger Cartridges which are themselves comprised of a Metadata Extractor and RDF Schema/Ontology Mapper components. Metadata extracted from non-RDF resources is used as the basis for generating structured data by mapping it to a suitable ontology.
The Sponger is highly customizable. Custom cartridges can be developed using any language supported by the Virtuoso Server Extensions API enabling RDF instance data generation from resource types not available in the default Sponger Cartridge collection bundled as a Virtuoso VAD package (rdf_cartridges_dav.vad).
Figure 1: Virtuoso metadata extraction & RDF structured data generation
Sponger Benefits
The Sponger delivers middleware that accelerates the bootstrapping of the Semantic Data Web by generating RDF Linked Data from non-RDF data sources, unobtrusively. This "Swiss army knife" for on-the-fly Linked Data generation provides a bridge between the traditional Document Web and the Semantic Data Web ("Data Web").
Sponging data from non-RDF Web sources and converting it to RDF exposes the data in a canonical form for querying and inference, and enables fast and easy linked data Mesh-ups as an enhancement of current Web 2.0 oriented Mash-ups. The key difference being that Mesh-ups are constructed from Structured Data while Mash-ups are constructed from Semi- or Un-structured data sources.
The RDF extraction and instance data generation products that offer functionality demonstrated by the Sponger are also commonly referred to as "RDFizers".
Using The Sponger
The Sponger can be invoked via the following mechanisms:
- Virtuoso SPARQL query processor
- RDF Proxy Service exposed at the "/proxy/rdf/" endpoint of any Virtuoso installation (e.g. http://localhost:8890/proxy/rdf )
- OpenLink RDF client applications
- ODS-Briefcase (Virtuoso WebDAV)
- Directly via Virtuoso PL
SPARQL Query Processor
Virtuoso extends the SPARQL Query Language such that it is possible to download RDF resources from a given IRI, parse, and then store the resulting triples in a graph, with all three operations performed during the SPARQL query-execution process. The IRI/URI of the graph used to store the triples is usually equal to the URL where the resources are downloaded from, consequently the feature is known as "IRI/URI dereferencing". If a SPARQL query instructs the SPARQL processor to retrieve the target graph into local storage, then the SPARQL sponger will be invoked.
The SPARQL extensions for IRI dereferencing are described below. Essentially these enable downloading and local storage of selected triples either from one or more named graphs, or based on a proximity search from a starting URI for entities matching the select criteria and also related by the specified predicates, up to a given depth. For full details please refer to the OpenLink Virtuoso Reference Manual , section "IRI Dereferencing".
SPARQL Extensions for IRI Dereferencing of FROM Clauses
Virtuoso extends the syntax of the SPARQL "FROM" and "FROM NAMED" clauses. It allows an additional list of options at the end of both clauses: option ( get: option1 value1 , get: param2 value2 , ... ), where the names of the allowed parameters are:
- *get:soft* is the retrieval mode. Supported values are "soft" and "replace" or "replacing". If the value is "soft" then the SPARQL processor will not try to retrieve triples if the destination graph is already populated, i.e., isn't empty. The get:soft option must be present in order for the other get: options to be recognized. Values "replace" or "replacing" clear the local graph cache.
- *get:uri* is the IRI to retrieve if it is not equal to the IRI of the FROM clause. This option can be used if the data should be retrieved from a mirror, not from the original resource location, or in any other case when the destination graph IRI differs from the location of the resource.
- *get:method* is the HTTP method which should be used to retrieve the resource. Supported methods are "GET" for plain HTTP and "MGET" for a URIQA web service endpoint. By default, "MGET" is used for IRIs that end with "/" and "GET" for everything else.
- *get:refresh* is the maximum allowed age of the locally cached resource, irrespective of what is specified by the server where the resource resides. The option value is a positive integer specifying the maximum age in seconds. Virtuoso reads HTTP headers and uses the "Date", "ETag", "Expires", "Last-Modified", "Cache-Control" and "Pragma: no-cache" fields to calculate when the resource should be reloaded. The get:refresh option value can override and reduce this calculated value, but cannot increase it.
- *get:proxy* is the address (in the form of a "host:port" string) of the proxy server to use if direct download is impossible.
*Example:*
SELECT ?id
FROM NAMED <http://myhost/user1.ttl>
OPTION (get:soft "soft", get:method "GET")
FROM NAMED <http://myhost/user2.ttl>
OPTION (get:soft "soft", get:method "GET")
WHERE { GRAPH ?g { ?id a ?o } };
If a get:... parameter repeats for every FROM clause, it can be written as a global pragma; so the above query can be rewritten as:
DEFINE get:method "GET"
DEFINE get:soft "soft"
SELECT ?id
FROM NAMED <http://myhost/user1.ttl>
FROM NAMED <http://myhost/user2.ttl>
WHERE { GRAPH ?g { ?id a ?o } };
SPARQL Extensions for IRI Dereferencing of Variables
In addition to the "define get:..." SPARQL extensions for IRI dereferencing in FROM clauses, Virtuoso supports dereferencing SPARQL IRIs used in the WHERE clause (graph patterns) of a SPARQL query via a set of "define input:grab-..." pragmas.
Consider an RDF resource which describes a member of a contact list, /user1/, and also contains statements about other users, user2 and /user3/, known to him. Resource user3 in turn contains statements about user4 and so on. If all the data relating to these users were loaded into Virtuoso's RDF database, the query to retrieve the details of all the users could be quite simple. e.g.:
SELECT ?id ?fullname ?email
WHERE { GRAPH ?g { ?id a <Person> ; <FullName> ?fullname ; <Email> ?email . }}
But what if some or all of these resources were not present in Virtuoso's quad store? The highly distributed nature of the Semantic Data Web makes it highly likely that these interlinked resources would be spread across several data spaces. Virtuoso's 'input:grab-...' extensions to SPARQL enable IRI dereferencing in such a way that all appropriate resources are loaded, i.e. "sponged", during query execution, even if some of the resources are not known beforehand. For any particular resource matched, and if necessary downloaded, by the query, it is possible to download related resources via a designated predicate path(s) to a specifiable depth i.e. number of 'hops', distance, or degrees of separation (i.e compute Transitive Closures in SPARQL).
Using Virtuoso's 'input:grab-' pragmas to enable sponging, the above query might be recast to:
DEFINE input:grab-var "?more"
DEFINE input:grab-depth 10
DEFINE input:grab-limit 100
DEFINE input:grab-base-iri "http://myhost/"
SELECT ?id ?fullname ?email
WHERE {
GRAPH ?g {
?id a <Person> ;
<FullName> ?fullname ;
<EMail> ?email .
OPTIONAL { ?id <SeeAlso> ?more }
}
};
A more advanced example showing a designated predicate traversal path via input:grab-seealso extension is:
DEFINE input:grab-iri <http://dbpedia.org/resource/Munich>
DEFINE input:grab-depth 10
DEFINE input:grab-seealso <http://dbpedia.org/property/hasPhotoCollection>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT *
WHERE {<http://dbpedia.org/resource/Munich> foaf:depiction ?o}
A summary of the input:grab pragmas is given below. Again, for full details please refer to the Virtuoso Reference Manual .
- *input:grab-var* specifies the name of the SPARQL variable whose values should be used as IRIs of resources that should be downloaded.
- *input:grab-iri* specifies an IRI that should be retrieved before executing the rest of the query, if it is not in the quad store already. (This pragma can be included multiple times).
- *input:grab-seealso* specifies a predicate IRI to be used when traversing a graph. (This pragma can be included multiple times).
- *input:grab-limit* sets the maximum number of resources (graph subject or object nodes) to be retrieved from a target graph.
- *input:grab-depth* sets the maximum 'degrees of separation' or links (predicates) between nodes in the target graph.
- *input:grab-all "yes"* instructs the SPARQL processor to dereference everything related to the query. All variables and literal IRIs in the query become values for input:grab-var and input:grab-iri. The resulting performance may be very bad.
- *input:grab-base* specifies the base IRI to use when converting relative IRIs to absolute. (Default: empty string).
- *input:grab-destination* overrides the default IRI dereferencing and forces all retrieved triples to be stored in the specified graph.
- *input:grab-loader* identifies the procedure used to retrieve each resource via HTTP, parse and store it. (Default: DB.DBA.RDF_SPONGE_UP)
- *input:grab-resolver* identifies the procedure that resolves IRIs and determines the HTTP method of retrieval. (Default: DB.DBA.RDF_GRAB_RESOLVER_DEFAULT)
RDF Proxy Service
Sponger functionality is also exposed via Virtuoso's "/proxy/rdf/" endpoint, as an in-built REST style Web service available in any Virtuoso standard installation. This web service takes a target URL and either returns the content "as is" or tries to transform (by sponging) to RDF. Thus, the proxy service can be used as a 'pipe' for RDF browsers to browse non-RDF sources.
The RDF proxy service takes following URL parameters:
- url: the URL of the target
- force: if 'rdf' is specified will try to extract RDF data from the target and return it
- header: HTTP headers to be sent to the target
- output-format: if 'force=rdf' is specified, the output-format parameter specifies the output MIME type of the RDF data. The default is 'rdf+xml'. It can also be 'n3', 'turtle' or 'ttl'. When no 'output-format' is given and RDF data is asked for, the result will be serialized with a MIME type determined by the 'Accept' header, i.e. the proxy service will do content negotiation.
Example:
The URLs below can be pasted into a traditional (X)HTML oriented document-web browser:
http://demo.openlinksw.com/proxy/rdf/http://www.w3c.org/People/Connolly
Notice that the URL of the data source (
http://www.w3c.org/People/Connolly
) is given as a query string to the proxy, together with any Sponger options (
force=rdf
).
OpenLink RDF Client Applications
OpenLink currently provides two RDF client applications bundled as part of the OpenLink AJAX Toolkit , an Data Explorer and an interactive SPARQL query builder, iSPARQL . Both utilise sponging.
The OpenLink RDF Browser uses the /proxy/rdf/ service by default, running in 'soft' sponge mode.
iSPARQL uses the /sparql service and allows the user more control over sponging through five possible settings:
- Get Local Data Only
- Get Remote Data When Missing Locally
- Get All Remote Data
- Get All Remote & Related Data
- Get Everything
These settings are translated to IRI dereferencing pragmas on the server as follows:
| iSPARQL sponging setting | /sparql endpoint: "should sponge" query parameter value | SPARQL processor directives |
| Get Local Data Only | N/A | N/A |
| Get Remote Data When Missing Locally | soft | define get:soft "soft" |
| Get All Remote Data | grab-all | define input:grab-all "yes" define input:grab-depth 5 define input:grab-limit 100 |
| Get All Remote Data & Related Data | grab-seealso | define input:grab-all "yes" define input:grab-depth 5 define input:grab-limit 200 define input:grab-seealso <http://www.w3.org/2000 /01/rdf-schema#seeAlso> define input:grab-seealso <http://xmlns.com/foaf/0.1/seeAlso> define input:grab-seealso <http://www.w3.org/2000/01/rdf-schema#IsDefinedBy> define input:grab-seealso <http://rdfs.org/sioc/ns#links_to> define input:grab-seealso <Other-Transitive-Predicates> |
| Get Everything | grab-everything | define input:grab-all "yes" define input:grab-intermediate "yes" define input:grab-depth 5 define input:grab-limit 500 define input:grab-seealso <http://www.w3.org/2000 /01/rdf-schema#seeAlso>; define input:grab-seealso <http://xmlns.com/foaf/0.1/seeAlso> |
ODS-Briefcase (Virtuoso WebDAV)
ODS-Briefcase is a component of OpenLink Data Spaces (ODS), a new generation distributed collaborative application platform for creating Semantic Web presence via Data Spaces derived from weblogs, wikis, feed aggregators, photo galleries, shared bookmarks, discussion forums and more. It is also a high level interface to the Virtuoso WebDAV repository.
ODS-Briefcase offers file-sharing functionality that includes the following features:
- Web browser-based interactions
- Web Services (direct use of the HTTP based WebDAV protocol)
- SPARQL query language support - all WebDAV resources are exposed as SIOC ontology instance data (RDF data sets)
When resources or documents are put into the ODS Briefcase and are made publicly readable (via a Unix-style +r permission or ACL setting) and the resource in question is of a supported content type, metadata is automatically extracted at file upload time.
Note*/: ODS-Briefcase extracts metadata from a wide array of file formats, automatically./
The extracted metadata is available in two forms, pure WebDAV and RDF (with RDF/XML or N3/Turtle serialization options), that is optionally synchronized with the underlying Virtuoso Quad Store.
All public readable resources in WebDAV have their owner, creation time, update time, size and tags published, plus associated content type dependent metadata. This WebDAV metadata is also available in RDF form as a SPARQL query-able graph accessible via the SPARQL protocol endpoint using the WebDAV location as the RDF data set URI (graph or data source URI).
You can also use a special RDF_Sink folder to automate the process of uploading RDF resources files into the Virtuoso Quad Store via WebDAV or raw HTTP. The properties of the special folder control whether sponging (RDFization) occurs. Of course, by default, this feature is enabled across all Virtuoso and ODS installations (with an ODS-Briefcase Data Space instance enabled).
Raw HTTP Example using CURL:
Username: demo Password: demo Source File: wine.rdf Destination Folder: http://demo.openlinksw.com/DAV/home/demo/rdf_sink/ Content Type: application/rdf+xml $ curl -v -T wine.rdf -H content-type:application/rdf+xml http://demo.openlinksw.com/DAV/home/demo/rdf_sink/ -u demo:demo
Finally, you can also get RDF data into Virtuoso's Quad Store via WebDAV using the Virtuoso Web Crawler utility (configurable via the Virtuoso Conductor UI). This feature also provides the ability to enable or disable Sponging as depicted below in Figure 2.
Sponger and ODS-Briefcase Structured Data Extractor
InterrelationshipAs the Sponger and ODS-Briefcase both extract structured data, what is the relationship between these two facilities?
The principal difference between the two is that the Sponger is an /RDF data crawler & generator/ , whereas Briefcase's structured data extractor is a WebDAV resource filter . The Briefcase structured data extractor is aimed at providing RDF data from WebDAV resources. Thus, if none of the available Sponger cartridges are able to extract metadata and produce RDF structured data, the Sponger calls upon the Briefcase extractor as the last resort in the RDF structured data generation pipeline.
Directly via Virtuoso PL
Sponger cartridges are invoked through a cartridge hook which provides a Virtuoso PL entry point to the packaged functionality. Should you wish to utilize the Sponger from your own Virtuoso PL procedures, you can do so by calling these hook routines directly. Full details of the hook function prototype and how to define your own cartridges are presented later in this document.
Consuming the Generated RDF Structured Data
The generated RDF-based structured data (RDF) can be consumed in a number of ways, depending on whether or not the data is persisted in Virtuoso's RDF Quad Store.
If the data is persisted, it can be queried through the Virtuoso SPARQL endpoint associated with any Virtuoso instance: /sparql. The RDF is exposed in a graph typically identified using a URL matching the source resource URL from which the RDF data was generated. Naturally, any SQL query can also access this, since SPARQL can be freely intermixed with SQL via Virtuoso's SPASQL (SPARQL inside SQL) functionality. RDF data is also accessible through Virtuoso's implementation of the URIQA protocol.
If not persisted, as is the case with the RDF Proxy Service, the data can be consumed by an RDF aware Web client, e.g. an RDF browser such as the OpenLink RDF Browser.
Data Sources Supported by the Sponger
- RDF, including N3 or Turtle: automatically recognized ontologies include:
- SIOC, SKOS, FOAF, AtomOWL, Annotea, Music Ontology, Bibliograhic Ontology, EXIF, vCard, and others
- (X)HTML pages
- HTML header metadata tags: Dublin Core
- Embedded microformats: eRDF, RDFa, hCard, hCalendar, XFN, and xFolk
- Syndication Formats
- RSS 2.0
- Atom
- OPML
- OCS
- XBEL (for bookmarks)
- GRDDL
- REST-style Web Service APIs: Google Base, Flickr, Del.icio.us, Ning, Amazon, eBay, Freebase, Facebook, raw HTTP, etc.
- Files: A multitude of built-in extractors are available for a variety of file formats and MIME types including:
- Binary files: MS Office, OpenOffice?, Open Document Format, images, audio, video, etc.
- Web services contract files: (BPEL, WSDL), XBRL, XBEL
- Data exchange formats: iCalendar, vCard
- Virtuoso VADs
- OpenLink license files
- Third party metadata extraction frameworks: Aperture, Spotlight and SIMILE RDFizers
How Does It Work?
Metadata Extraction
When an RDF aware client requests data from a network accessible resource via the Sponger the following events occur:
- A request is made for data in RDF form (explicitly via Content Negotiation using HTTP Accept Headers), and if RDF is returned nothing further happens.
- If RDF isn't returned, the Sponger passes the data through a *Metadata Extraction Pipeline* (using Metadata Extractors).
- The extracted data is transformed into RDF via a Mapping Pipeline . RDF entities (instance data) are generated by way of ontology matching and mapping.
- RDF instance data (aka. RDF Structured Linked Data) are returned to the client.
Extraction Pipeline
Depending on the file or format type detected at ingest, the Sponger applies the appropriate metadata extractor. Detection occurs at the time of content negotiation instigated by the retrieval user agent. The normal metadata extraction pipeline processing is follows:
- The Sponger tries to get RDF data (including N3 or Turtle) directly from the dereferenced URL. If it finds some, it returns it, otherwise, it continues.
- If the URL refers to a HTML file, the Sponger tries to find "link" elements referring to RDF documents. If it finds one or more of them, it adds their triples into a temporary RDF graph and continues its processing.
- The Sponger then scans for microformats markup or GRDDL profile URIs. If either is found, RDF triples are generated and added to a temporary RDF graph before continuing.
- If the Sponger finds eRDF or RDFa data in the HTML file, it extracts it from the HTML file and inserts it into the RDF graph before continuing.
- If the Sponger finds it is talking to a web service such as Google Base, it maps the API of the web service with an ontology, creates triples from that mapping and inserts the triples into the temporary RDF graph.
- The next fallback is scanning of the HTML header for different Web 2.0 types or RSS 1.1, RSS 2.0, Atom, etc.
- Failing those tests, the scan then uses standard Web 1.0 rules to search in the header tags for metadata (typically Dublin Core) and transforms them to RDF and again adds them to the temporary graph. Other HTTP response header data may also be transformed to RDF.
- If nothing has been retrieved at this point, the Briefcase metadata extractor is tried.
- Finally, if nothing is found, the Sponger will return an empty graph (should the HTTP cartridge be disabled).
Mapping to Ontologies
RDF generation is done on the fly either using built-in XSLT processors, or in the case of GRDDL, the associated XSLT (exposed via Profile URIs) and local or remote XSLT processors. The RDF generation performed by the Mapping Pipeline is based on an internal mapping table which associates the source data's type with schemas and ontologies. This mapping will vary depending on if you are using Virtuoso with or without the ODS layer. If the ODS application layer (meaning the ODS-Framework and the ODS-Briefcase Data Space application at the very least) is present, the Sponger performs additional mapping using SIOC , SKOS , FOAF , AtomOWL, Annotea bookmarks, Annotea annotations, EXIF , and other ontologies depending on the source data.
The number of ontologies handled by the Sponger is being increased constantly. To identify which ontologies are supported, view the Conductor's RDF Cartridges configuration panel as described later. For details of how to determine the full ontology set supported by Briefcase, refer to Appendix A.
SIOC as a Data Space Glue Ontology
ODS has its own built-in cartridges for the SIOC ontology which it uses as a data space "glue" ontology. SIOC provides a generic data model of containers, items, item types, and associations between items. The actual classes defined by SIOC include: User, UserGroup?, Role, Site, Forum and Post. A separate SIOC types module (sioc-t) extends the SIOC Core ontology by defining additional super-classes, sub-classes and sub-properties to the original SIOC terms. Subclasses include: AddressBook?, BookmarkFolder?, Briefcase, EventCalendar?, ImageGallery?, Wiki, Weblog, BlogPost?, plus many others. Within this generic model, SIOC permits the use of other ontologies (FOAF etc.) in describing attributes of SIOC entities that provide sound conceptual partitioning of data spaces that expose RDF Linked Data. Super-classes include: Container (a generic container of Items) and Space (Data Spaces). Thus, it's safe to say that SIOC delivers a generic wrapper, or "glue", ontology for integrating structured RDF data from a myriad of heterogeneous web accessible data sources.
All the data containers (briefcases, blogs, wikis, discussions etc.) maintained by the various ODS application realms (Data Spaces) describe and expose their data as SIOC instance data. The ODS SIOC Reference Guide details the SIOC mappings for each ODS application component (ODS-Framework, ODS-Weblog, ODS-Briefcase, ODS-Feed-Manager, ODS-Wiki, ODS-Mail, ODS-Calendar, ODS-Bookmark-Manager, ODS-Gallery, ODS-Polls, ODS-Addressbook, ODS-Discussion and ODS-Community). Example SPARQL queries for interacting with the SIOC instance data are also shown. In the context of the Sponger, the SIOC mappings used by ODS-Briefcase are some of the most powerful aspects of ODS as a whole (i.e. delivering a platform independent and web architecture based variant of Mac OS X's Spotlight functionality).
Proxy Service Caching
When the Proxy Service is invoked by a user agent, the Sponger caches the imported data in temporary Virtuoso storage. The cache's invalidation rules conform to those of traditional Web browsers. The data expiration time is determined based on subsequent data fetches of the same resource. The first data retrieval records the 'expires' header. On subsequent fetches, the current time is compared to the expiration time stored in the local cache. If HTTP 'expires' header data isn't returned by the source data server, the Sponger will derive its own expiration time by evaluating the 'date' header and 'last-modified' HTTP headers. The cache can be forcefully cleared using the SPARQL extensions get:soft "replace" or get:soft "replacing", as described earlier in the section "SPARQL Extensions for IRI Dereferencing".
Sponger Architecture
As described earlier and illustrated below, the Sponger is comprised of cartridges which are themselves comprised of metadata extractors and ontology mappers .
A cartridge is invoked through its cartridge hook , a Virtuoso PL procedure entry point and binding to the cartridge's metadata extractor and ontology mapper.
Metadata Extractors
Metadata extractors perform the initial data extraction operations against data sources that include: (X)HTML documents, XML based syndication formats (RSS, Atom, OPML, OCS etc.), binary files, REST style Web services and Microformats (non GRDDL, GRDDL, eRDF, and RDFa). Each metadata extractor is aligned to at least one ontology mapper.
Metadata extractors are built using Virtuoso PL, C/C++, Java or any other external language supported by Virtuoso's Server Extension API. Of course, Virtuoso's own metadata extractors are written in Virtuoso PL. Third party extractors can be harnessed through the external language support, examples being XMP and Spotlight (both C/C++ based), Aperture (Java based), and SIMILE RDFizers (also Java based).
Ontology Mappers
Sponger ontology mappers perform the the task of generating RDF instance data from extracted metadata (non-RDF) using ontologies associated with a given data source type. They are typically XSLT (using GRDDL or an in-built Virtuoso mapping scheme) or Virtuoso PL based. Virtuoso comes pre-configured with a large range of ontology mappers contained in one or more Sponger cartridges. Nevertheless you are free to create and add your own cartridges, ontology mappers, or metadata extractors.
Figure 3: Sponger architectureBelow is an extract from the stylesheet /DAV/VAD/rdf_cartridges/xslt/flickr2rdf.xsl, used for extracting metadata from Flickr images. Here, the template combines RDF metadata extraction and ontology mapping based on the FOAF and Dublin Core ontologies.
<xsl:template match="owner">
<rdf:Description rdf:nodeID="person">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/#Person" />
<xsl:if test="@realname != ''">
<foaf:name><xsl:value-of select="@realname"/></foaf:name>
</xsl:if>
<foaf:nick><xsl:value-of select="@username"/></foaf:nick>
</rdf:Description>
</xsl:template>
<xsl:template match="photo">
<rdf:Description rdf:about="{$baseUri}">
<rdf:type rdf:resource="http://www.w3.org/2003/12/exif/ns/IFD"/>
<xsl:variable name="lic" select="@license"/>
<dc:creator rdf:nodeID="person" />
...
Cartridge Registry
Once a Sponger cartridge has been developed it must be plugged into the SPARQL engine by registering it in the Cartridge Registry, i.e. by adding a record in the table DB.DBA.SYS_RDF_CARTRIDGES, either manually via DML, or more easily through Conductor (Virtuoso's browser-based administration console), which provides a UI for adding your own cartridges. Sponger configuration using Conductor is described in detail later. For the moment, we'll focus on outlining the broad architecture of the Sponger.
The SYS_RDF_CARTRIDGES table definition is as follows:
create table DB.DBA.SYS_RDF_CARTRIDGES (
RC_ID integer identity, -- cartridge ID, designate order of execution
RC_PATTERN varchar, -- a REGEX pattern to match URL or MIME type
RC_TYPE varchar default 'MIME', -- what property of the current resource to
match: MIME or URL are supported at present
RC_HOOK varchar, -- fully qualified PL function name e.g.
DB.DBA.MY_CARTRIDGE_FUNCTION
RC_KEY long varchar, -- API specific key to use
RC_DESCRIPTION long varchar, -- Cartridge description, free text
RC_ENABLED integer default 1, -- a flag 0 or 1 integer to include or exclude
the given cartridge from processing chain
primary key (RC_TYPE, RC_PATTERN)
);
Cartridge Invocation
The Virtuoso SPARQL processor supports IRI dereferencing via the Sponger. Thus, if the SPARQL query contains references to non-default graph URIs the Sponger goes out (via HTTP) to grab the RDF data sources exposed by the data source URIs and then places them into local storage (as Default or Named Graphs depending on the SPARQL query). Since SPARQL is RDF based, it can only process RDF-based structured data, serialized using RDF/XML, Turtle or N3 formats. As a result, when the SPARQL processor encounters a non-RDF data source, a call to the Sponger is triggered. The Sponger then locates the appropriate cartridge for the data source type in question, resulting in the production of SPARQL-palatable RDF instance data. If none of the registered cartridges are capable of handling the received content type, the Sponger will attempt to obtain RDF instance data via the in-built WebDAV metadata extractor.
Sponger cartridges are invoked during the aforementioned pipeline as follows:
When the SPARQL processor dereferences a URI, it plays the role of an HTTP user agent (client) that makes a content type specific request to an HTTP server via the HTTP request's Accept headers. The following then occurs:
- If the content type returned is RDF then no further transformation is needed and the process stops. For instance, when consuming an (X)HTML document with a GRDDL profile, the profile URI points to a data provider that simply returns RDF instance data.
- If the content type is not RDF (i.e. application/rdf+xml or text/rdf+n3 ), for instance 'text/plain', the Sponger looks in the Cartridge Registry iterating over every record for which the RC_ENABLED flag is true, with the look-up sequence ordered on the RC_ID column values. For each record, the processor tries matching the content type or URL against the RC_PATTERN value and, if there is match, the function specified in RC_HOOK column is called. If the function doesn't exist, or signals an error, the SPARQL processor looks at next record.
- If the hook returns zero, the next cartridge is tried. (A cartridge function can return zero if it believes a subsequent cartridge in the chain is capable of extracting more RDF data.)
- If the result returned by the hook is negative, the Sponger is instructed that no RDF was generated and the process stops.
- If the hook result is positive, the Sponger is informed that structured data was retrieved and the process stops.
- If none of the cartridges match the source data signature (content type or URL), the built-in WebDAV metadata extractor's RDF generator is called.
Figure 4: Sponger cartridge invocation flowchart
Sponger Configuration Using Conductor
The Virtuoso Conductor provides a graphical UI for most Virtuoso administration tasks, including interfaces for managing Sponger Cartridges.
Cartridge Packaging & Deployment
The VAD (Virtuoso Application Distribution) package rdf_cartridges_dav bundles a variety of pre-built cartridges for generating RDF instance data from a large range of popular Web resources and file types. Appendix B provides full details of the VAD's contents. The cartridges installed by the VAD can be viewed and configured through Conductor's RDF Cartridges pane, shown below.
Figure 5: Conductor's RDF Cartridges pane
Earlier we outlined the structured data generation pipeline in which the search sequence for possible sources of metadata is controlled by the RDF cartridge ordering. This ordering can configured through the Conductor UI, as shown. The order in which cartridges are tried is reflected in the 'Seq#' values.
Among the various entry fields are fields for the cartridge hook function and the URL/MIME-type pattern, corresponding to the RC_HOOK and RC_PATTERN columns of the SYS_RDF_CARTRIDGES table.
Figure 6: Flickr cartridge configuration settings
XSLT Templates
The RDF Cartridges VAD package includes a number of XSLT templates, all located in the folder /DAV/VAD/rdf_cartridges/xslt/. All the available templates can be viewed through Virtuoso's WebDAV browser, as illustrated below.
Figure 7: RDF Cartridges VAD package - XSLT templates
GRDDL Mappings
Some of the XSLT templates contained in /DAV/VAD/rdf_cartridges/xslt/ are GRDDL filters. The GRDDL filters can be configured through the GRDDL Mappings panel in Conductor, shown below. The URI for stylesheets stored in a Virtuoso WebDAV repository takes the form
virt://WS.WS.SYS_DAV_RES.RES_FULL_PATH.RES_CONTENT:<WebDAV path>.
Figure 8: RDF Cartridges VAD package - GRDDL filters
Custom Cartridges
The Sponger is fully extensible by virtue of its Cartridge plug-in architecture. New data formats can be sponged by creating new cartridges. While OpenLink is actively adding cartridges for new data sources, you are obviously free to develop your own custom cartridges. To this end, details of the cartridge hook and example cartridge implementations are presented below.
Cartridge Hook Prototype
Every Virtuoso PL hook function used to plug a custom Sponger cartridge into the Virtuoso SPARQL engine must have a parameter list with the following parameters (the names of the parameters are not important, but their order and presence are) :
*in graph_iri varchar:* the graph IRI which is currently retrieved *in new_origin_uri varchar:* the URL of the document retrieved *in destination varchar:* the destination graph IRI *inout content any:* the content of the document retrieved by Sponger *inout async_queue any:* if the PingService? initialization parameter has been configured in the [SPARQL] section of the virtuoso.ini file, this is a pre-allocated asynchronous queue to be used to call the PingTheSemanticWeb notification service *inout ping_service any:* the URL of a ping service, as assigned to the PingService? parameter in the [SPARQL] section of the virtuoso.ini configuration file. This argument could be used to notify the PingTheSemanticWeb notification service *inout api_key any:* a string value specific to a given cartridge, contained in the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table. The value can be a single string or a serialized array of strings providing cartridge specific data.
Example Cartridge Implementations
Basic Sponger Cartridge
In our first example (which is available in the form of an on-line tutorial ) we implement a basic cartridge, which maps the MIME type text/plain to an imaginary ontology which extends the class Document from FOAF with properties 'txt:UniqueWords??' and 'txt:Chars', where the prefix 'txt:' is specified as 'urn:txt:v0.0:'.
use DB;
create procedure DB.DBA.RDF_LOAD_TXT_META
(
in graph_iri varchar,
in new_origin_uri varchar,
in dest varchar,
inout ret_body any,
inout aq any,
inout ps any,
inout ser_key any
)
{
declare words, chars int;
declare vtb, arr, subj, ses, str any;
declare ses any;
-- if any error we just say nothing can be done
declare exit handler for sqlstate '*'
{
return 0;
};
subj := coalesce (dest, new_origin_uri);
vtb := vt_batch (); chars := length (ret_body);
-- using the text index procedures we get a list of words
vt_batch_feed (vtb, ret_body, 1);
arr := vt_batch_strings_array (vtb);
-- the list has 'word' and positions array, so we must divide by 2
words := length (arr) / 2;
ses := string_output ();
-- we compose a N3 literal
http (sprintf ('<%s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .\n', subj), ses);
http (sprintf ('<%s> <urn:txt:v0.0:UniqueWords> "%d" .\n', subj, words), ses);
http (sprintf ('<%s> <urn:txt:v0.0:Chars> "%d" .\n', subj, chars), ses);
str := string_output_string (ses);
-- we push the N3 text into the local store
DB.DBA.TTLP (str, new_origin_uri, subj);
return 1;
};
delete from DB.DBA.SYS_RDF_CARTRIDGES where RC_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';
insert soft DB.DBA.SYS_RDF_CARTRIDGES (RC_PATTERN, RC_TYPE, RC_HOOK, RC_KEY, RC_DESCRIPTION) values ('(text/plain)', 'MIME', 'DB.DBA.RDF_LOAD_TXT_META', null, 'Text Files (demo)');
-- here we set order to some large number so don't break existing cartridges update DB.DBA.SYS_RDF_CARTRIDGES set RC_ID = 2000 where RC_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';
To test the cartridge you can use /sparql endpoint with option 'Retrieve remote RDF data for all missing source graphs' to execute:
select * from <URL-of-a-txt-file> where { ?s ?p ?o }
Notice in this example the use of DB.DBA.TTLP( ) to load the extracted structured data into the Virtuoso Quad Store. This RDF data import function parses TTL (TURTLE or N3) and inserts the triples into the table DB.DBA.RDF_QUAD, one of the key tables underpinning the Quad Store. For further details of Virtuoso's RDF and SPARQL API, please refer to the OpenLink Virtuoso Reference Manual .
Flickr Cartridge
The next example shows the Virtuoso/PL procedure RDF_LOAD_FLICKR_IMG at the heart of the Virtuoso's Flickr Sponger cartridge:
--no_c_escapes-
create procedure DB.DBA.RDF_LOAD_FLICKR_IMG (
in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
inout _ret_body any, inout aq any, inout ps any, inout _key any,
inout opts any)
{
declare xd, xt, url, tmp, api_key, img_id, hdr, exif any;
declare exit handler for sqlstate '*'
{
return 0;
};
tmp := sprintf_inverse (new_origin_uri,
'http://farm%s.static.flickr.com/%s/%s_%s.%s', 0);
img_id := tmp[2];
api_key := _key;
--cfg_item_value (virtuoso_ini_path (), 'SPARQL', 'FlickrAPIkey');
if (tmp is null or length (tmp) <> 5 or not isstring (api_key))
return 0;
url := sprintf
('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_id=%s&api_key=%s', img_id, api_key);
tmp := http_get (url, hdr);
if (hdr[0] not like 'HTTP/1._ 200 %')
signal ('22023', trim(hdr[0], '\r\n'), 'RDFXX');
xd := xtree_doc (tmp);
exif := xtree_doc ('<rsp/>');
{
declare exit handler for sqlstate '*' { goto ende; };
url := sprintf ('http://api.flickr.com/services/rest/?method=flickr.photos.getExif&photo_id=%s&api_key=%s', img_id, api_key);
tmp := http_get (url, hdr);
if (hdr[0] like 'HTTP/1._ 200 %')
exif := xtree_doc (tmp);
ende:;
}
xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', xd,
vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif));
xd := serialize_to_UTF8_xml (xt);
DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));
return 1;
}
Here the http_get( ) function retrieves an HTML page associated with the specified image, which is then parsed into an XML entity and in-memory XML parse tree by xtree_doc( ). Using the xslt( ) function with the stylesheet flickr2rdf.xsl, the XML entity is transformed into RDF/XML which is in turn parsed by RDF_LOAD_RDFXML( ) and the extracted triples loaded into the Virtuoso Quad Store.
Sponger Permissions
In order to allow the Sponger to update the local RDF quad store with triples constituting the sponged structured data, the role "SPARQL_UPDATE" must be granted to the account "SPARQL". This should normally be the case. If not, you must manually grant this permission. As with most Virtuoso DBA tasks, the Conductor provides the simplest means of doing this.
Custom Resolvers
The Sponger supports plug-in "Custom Resolver" cartridges in order to support the dereferencing of other forms of URIs besides HTTP URLs, such as URN schemes. The handle-based DOI naming scheme, the URN naming scheme, and also the URN-based LSID scheme, are examples of custom resolvers.
By supporting alternate resolvers the range of data sources which can be linked into the Semantic Data-Web is extended enormously. The LSID resolver enables URN-based resources to be accessible as linked data. Similarly, the DOI resolver permits the huge collection of DOI-based data sources to be linked into the Web of Linked Data (Data Web).
An example SPARQL query dereferencing a URN-based URI is shown below:
http://demo.openlinksw.com/sparql?default-graph-uri=urn:lsid:ubio.org:namebank:11815&should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html&debug=on
As one would expect, the RDF Proxy Service also recognizes URNs. e.g:
http://demo.openlinksw.com/proxy/rdf/urn:lsid:ubio.org:namebank:11815 or the same: http://demo.openlinksw.com/proxy/?url=urn:lsid:ubio.org:namebank:11815&force=rdf
Sponger Usage Examples
RDF Proxy Service
The file http://ode.openlinksw.com/example.html contains Examples of the OpenLink Data Explorer Extension. This XHTML file contains RDF embedded as RDFa. Running the file through the Sponger via Virtuoso's RDF proxy service extracts the embedded RDFa data as pure RDF, as can be seen by pasting the URL
http://demo.openlinksw.com/proxy/rdf/http://ode.openlinksw.com/example.html
into an HTML browser then viewing the resulting page source. Though this example demonstrates the action of the /proxy/rdf/ service quite transparently, it is a basic and unwieldy way to view sponged RDF data. OpenLink Data Explorer provides a more polished means to the same end. Indeed the OpenLink Data Explorer makes use of the same proxy service.
SPARQL Processor
As an alternative to using the RDF proxy service, we can sponge directly from within the SPARQL processor. After logging into Virtuoso's Conductor interface, the following query can be issued from the Interactive SQL (iSQL) panel:
sparql
define get:uri "http://www.ivan-herman.net/foaf.html"
define get:soft "soft"
select * from <http://mygraph> where {?s ?p ?o}
Here the sparql keyword invokes the SPARQL processor from the SQL interface and the RDF data sponged from page http://www.ivan-herman.net/foaf.html is loaded into the local RDF quad store as graph http://mygraph.
The new graph can then be queried using the basic SPARQL client normally available in a default Virtuoso installation at http://localhost:8890/sparql/. e.g.:
select * from <http://mygraph> where {?s ?p ?o}
(A much richer interactive SPARQL query builder, iSPARQL , is available as part of the OpenLink AJAX Toolkit (OAT), together with the OpenLink RDF Browser).
Custom Cartridge
The Virtuoso/PL code for a simple custom cartridge, DB.DBA.RDF_LOAD_TXT_META, was presented earlier. Included in the code was the SQL required to register the cartridge in the Cartridge Registry. Paste the whole of this code into Conductor's iSQL interface and execute it to define and register the cartridge.
Create a simple text document with a .txt extension. This must now be made Web accessible. A simple way to do this is to expose it as a WebDAV resource using Virtuoso's built-in WebDAV support. Login to Virtuoso's ODS Briefcase application, navigate to your Public folder and upload your text document, ensuring that the file extension is .txt, the MIME type is set to text/plain and the file permissions are rw-r--r--. If, for the purposes of this example, you logged into a local default Virtuoso instance as user 'dba' and uploaded a file named 'ODS_sponger_test.txt', the file would be Web accessible via the URL http://localhost:8890/DAV/home/dba/Public/ODS_sponger_test.txt.
To sponge the document using the RDF_LOAD_TXT_META cartridge, use the basic SPARQL client available at http://localhost:8890/sparql to execute the query
select * from
<http://localhost:8890/DAV/home/dba/Public/ODS_sponger_test.txt> where {?s
?p ?o}
with the option 'Retrieve remote RDF data for all missing source graphs' set. The returned result set should look something like:
|s|p|o| http://localhost:8890/DAV/home/dba/ |Public/ODS sponger_test.txt|http://www.w3.org/1999/02/22-rdf-syntax-ns#type|http://xmlns.com/foaf/0.1/ |Document| http://localhost:8890/DAV/home/dba/
| Public/ODS sponger_test.txt | urn:txt:v0.0:UniqueWords? | 7 |
|http://localhost:8890/DAV/home/dba/
| Public/ODS sponger_test.txt | urn:txt:v0.0:Chars | 44 |
Appendix A: Ontologies Supported by ODS-Briefcase
The full range of ontologies and mappings supported by the ODS-Briefcase metadata extractor is reflected in the contents of the Virtuoso directory DAV/VAD/oDrive/schemas/ (e.g. for a local Virtuoso instance, this would be http://localhost:8890/DAV/VAD/oDrive/schemas/).
The schema directory is browsed easily using the Conductor WebDAV Browser.
The schema files packaged in Briefcase cover both standard and custom ontologies. The standard ontologies include FOAF, OpenDocument? , RSS , XBEL , Apple Spotlight and vCard. Others are proprietary OpenLink ontologies for describing file types and content.
Below is a partial listing of one of these files, Office.rdf, which defines the proprietary Office ontology used by Virtuoso for mapping Microsoft Office documents to RDF structured data.
<?xml version="1.0" encoding="UTF-8"?>
<!--
-
- $Id: Office.rdf,v 1.4 2007/05/10 08:51:53 ddimitrov Exp $
-
- This file is part of the OpenLink Software Virtuoso Open-Source (VOS)
- project.
-
- Copyright (C) 1998-2007 OpenLink Software
-
...
-
-->
<rdf:RDF xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl ="http://www.w3.org/2002/07/owl#"
xmlns:virtrdf="http://www.openlinksw.com/schemas/virtrdf#"
xml:base="http://www.openlinksw.com/schemas/Office#">
<owl:Ontology rdf:about="http://www.openlinksw.com/schemas/Office#">
<rdfs:label>Microsoft Office document</rdfs:label>
<rdfs:comment>The Microsoft Office format general attributes.</rdfs:comment>
<virtrdf:catName>Office Documents (Microsoft)</virtrdf:catName>
<virtrdf:version>1.00</virtrdf:version>
</owl:Ontology>
<rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#TypeDescr">
<rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
<virtrdf:cardinality>single</virtrdf:cardinality>
<virtrdf:label>Document Type</virtrdf:label>
<virtrdf:catName>Document Type</virtrdf:catName>
</rdf:Property>
<rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Author">
<rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
<virtrdf:cardinality>single</virtrdf:cardinality>
<virtrdf:label>Author</virtrdf:label>
<virtrdf:defaultValue>No name</virtrdf:defaultValue>
<virtrdf:catName>Author</virtrdf:catName>
</rdf:Property>
<rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#LastAuthor">
<rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
<virtrdf:cardinality>single</virtrdf:cardinality>
<virtrdf:label>Last Author</virtrdf:label>
<virtrdf:defaultValue>No name</virtrdf:defaultValue>
<virtrdf:catName>Last Author</virtrdf:catName>
</rdf:Property>
<rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Company">
<rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/>
<virtrdf:cardinality>single</virtrdf:cardinality>
<virtrdf:label>Company</virtrdf:label>
<virtrdf:defaultValue>No name</virtrdf:defaultValue>
<virtrdf:catName>Company</virtrdf:catName>
</rdf:Property>
<rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Words">
<rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#integer"/>
<virtrdf:cardinality>single</virtrdf:cardinality>
<virtrdf:label>Word Count</virtrdf:label>
</rdf:Property>
...
<rdf:Property rdf:about="http://www.openlinksw.com/schemas/Office#Created">
<rdfs:Range rdf:resource="http://www.w3.org/2001/XMLSchema#dateTime"/>
<virtrdf:cardinality>single</virtrdf:cardinality>
<virtrdf:label>Date Created</virtrdf:label>
<virtrdf:catName>Created</virtrdf:catName>
</rdf:Property>
</rdf:RDF>
Appendix B: RDF Cartridges VAD Package
Virtuoso supplies a cartridge for extracting RDF data from certain popular Web resources and file types in the form of the VAD package rdf_cartridges_dav. If not already present, it can be installed using Conductor or the VAD_INSTALL function. Please refer to the Virtuoso Reference Manual for detailed information on VADs and VAD management.
Details of each of the cartridges contained in the RDF Cartridges VAD are given below.
HTTP in RDF
A cartridge for mapping HTTP request and response messages to the HTTP vocabulary expressed in RDF, as defined by http://www.w3.org/2006/http.rdfs .
This cartridge is disabled by default. If it is enabled, it must be first in the order of execution. The cartridge hook function always returns 0, allowing other cartridge to return additional RDF instance data.
XHTML and Feeds
This is a composite cartridge for discovering in HTML pages metadata embedded in a variety of forms. The cartridge looks for RDF data in the order listed below:
- Embedded/linked RDF
- Scans for metadata in a linked RDF document identified by a link element, e.g. <link rel="meta" type="application/rdf+xml" href="..."> (See Using RDF/XML with HTML and XHTML )
- RDF embedded in xHTML (as markup or inside XML comments)
- Micro-formats
- GRDDL / GRDDL Data Views - http://www.w3.org/2003/g/data-view
- RDFa
- hCard - XMDP profile: http://www.w3.org/2006/03/hcard
- hCalendar - XMDP profile: http://dannyayers.com/microformats/hcalendar-profile
- hReview - XMDP profile: http://dannyayers.com/micromodels/profiles/hreview
- relLicense - Creative Commons (CC) license schema: http://web.resource.org/cc/schema.rdf
- Dublin Core (DCMI) - Vocabulary definition: http://purl.org/dc/elements/1.1/
- geoURL - Vocabulary definition: http://www.w3.org/2003/01/geo/wgs84_pos#
- Google Base - OpenLink Virtuoso specific mapping
- Ning Metadata
- Feeds extraction
- RSS/RDF - SIOC & AtomOWL
- RSS 1.0 - RSS/RDF, SIOC & AtomOWL
- Atom 1.0 - RSS/RDF, SIOC & AtomOWL
- RSS/RDF: http://purl.org/rss/1.0/
- SIOC - Vocabulary definition: http://rdfs.org/sioc/ns#
- AtomOWL - Vocabulary definition: http://atomowl.org/ontologies/atomrdf.rdf
- xHTML metadata transformation using FOAF (foaf:Document) and Dublin Core properties (dc:title, dc:subject etc.)
- FOAF - Vocabulary definition: http://xmlns.com/foaf/0.1/
Flickr Images / URLs
A Sponger cartridge for Flickr images, using the Flickr REST API. To function properly it must have a configured key. The Flickr cartridge generates RDF instance data using: CC license, Dublin Core, Dublin Core Metadata Terms, GeoURL?, FOAF and EXIF (ontology definition: http://www.w3.org/2003/12/exif/ns/ ).
Amazon Articles / URLs
A cartridge for Amazon articles using the Amazon REST API. It needs a Amazon API key in order to be functional.
eBay Articles / URLs
Implements eBay's REST API in order to generate RDF from eBay articles. It needs a key and user name to be configured in order to work.
OpenOffice? Documents
OpenOffice? documents contain metadata which can be extracted using UNZIP, so this cartridge needs the Virtuoso UNZIP plugin to be configured on the server. (Each OpenOffice? file is actually a collection of XML documents stored in a ZIP archive).
Yahoo Traffic Data URLs
Transforms Yahoo traffic data to RDF.
iCalendar Files
Transforms iCalendar files to RDF as per http://www.w3.org/2002/12/cal/ical# .
Binary Content, PDF & Powerpoint Files
Unknown binary content, PDF files and MS PowerPoint? files can be transformed to RDF using the Aperture framework. This cartridge needs Virtuoso with Java hosting support, the Aperture framework and the MetaExtractor.class installed on the host system in order to work. For details of how to configure the Aperture framework see Appendix C.
Appendix C: Configuring the Aperture Framework
To set up Virtuoso to host and run the Aperture framework, follow these steps:
- Install Virtuoso with Java hosting .
- Download the Aperture framework from http://aperture.sourceforge.net .
- Unpack the framework in the Virtuoso working directory, i.e. the directory containing the database file. Make a symbolic link, 'lib', to the framework.
- Make sure the MetaExtractor.class is in the Virtuoso working directory .
- In the [Parameters] section of the virtuoso.ini configuration file, add the line:
JavaClasspath = lib/sesame-2.0-alpha-3.jar:lib/openrdf-util-crazy-debug.jar:lib/htmlparser-1.6.jar:lib/activation-1.0.2-upd2.jar:lib/bcmail-jdk14-132.jar:lib/poi-scratchpad-3.0-alpha2-20060616.jar:lib/openrdf-model-2.0-alpha-3.jar:lib/jacob-1.10-pre4.jar:lib/bcprov-jdk14-132.jar:lib/demork-2.0.jar:lib/commons-codec.jar:lib/fontbox-0.1.0-dev.jar:lib/pdfbox-0.7.3.jar:lib/applewrapper-0.1.jar:lib/junit-3.8.1.jar:lib/winlaf-0.5.1.jar:lib/aperture-test-2006.1-alpha-3.jar:lib/openrdf-util-fixed-locking.jar:lib/commons-logging-1.1.jar:lib/mail-1.4.jar:lib/aperture-2006.1-alpha-3.jar:lib/poi-3.0-alpha2-20060616.jar:lib/ical4j-cvs20061019.jar:lib/openrdf-util-2.0-alpha-3.jar:lib/rio-2.0-alpha-3.jar:lib/poi-contrib-3.0-alpha2-20060616.jar:lib/aperture-examples-2006.1-alpha-3.jar:.
- Start the Virtuoso server with Java hosting support .
- Connect with Virtuoso's ISQL tool and check the installation is complete by issuing the commands below:
SQL> DB.DBA.import_jar (NULL, 'MetaExtractor', 1);
Done. -- 466 msec.
SQL> select "MetaExtractor"().getMetaFromFile ('some_pdf_in_server_working_dir.pdf', 5);
... some RDF data should be returned ...
Important: The installation guidelines presented above have been verified on Linux with aperture-2006.1-alpha-3. Some adjustment may be needed for different operating systems or versions of Aperture.
Appendix D: Deprecated Naming Conventions
Throughout this document and in the latest Virtuoso releases, the term "cartridge" is used to identify the pluggable Sponger components through which non-RDF data is transformed to RDF, by way of metadata-extraction and ontology-mapping. Earlier releases of Virtuoso used the term "mapper" in place of "cartridge". The table below lists the components affected by this change in nomenclature, indicating the new and old component names.
| Component New Name | Component Old Name | Component Type |
| RDF Cartridges VAD | RDF Mappers VAD | VAD label |
| /DAV/VAD/rdf_cartridges | /DAV/VAD/rdf_mappers | WebDAV path |
| RDF Cartridges | RDF Mappers | Conductor configuration panel |
| rdf_cartridges_dav.vad | rdf mappers_dav.vad | VAD package |
| _rdf_cartridges_path | _rdf_mappers_path | Registry entry |
| SYS_RDF_CARTRIDGES | SYS_RDF_MAPPERS | DBMS table |
| RC_xxx | RM_xxx | Column prefix in SYS_RDF_CARTRIDGES / SYS_RDF_MAPPERS table |
Glossary
Aperture - a Java framework for extracting and querying full-text content and metadata from various information systems (file systems, web sites, mail boxes etc) and file formats (documents, images etc).
CRNI Handle System - provides unique persistent identifiers (handles) for Internet resources. It is a general purpose distributed information system providing identifier and resolution services through a namespace and an open set of protocols which allow handles to be resolved into the information necessary to locate, access and use the resources they identify.
Data Spaces - points of presence on the web for accessing structured data gleaned from a variety of heterogeneous data sources.
DOI - a digital object identifier. A location-independent, permanent document or digital resource identifier, based on the CNRI Handle System, which does not change, even if the resource is relocated. DOIs are resolved through the DOI resolver.
eRDF - HTML Embeddable RDF. A technique for embedding a subset of RDF into (X)HTML.
hCard - a microformat for publishing the contact details of people, companies, organizations, and places, in (X)HTML, Atom, RSS, or arbitrary XML.
geoURL - is a location-to-URL reverse directory allowing you to find URLs by their proximity to a given location.
GRDDL - G leaning R esource D escriptions from D ialects of L anguages - a mechanism for extracting RDF data from XML and XHTML documents using transformation algorithms typically represented in XSLT. The transformation algorithms may be explicitly associated using a link element in the head of the document, or held in an associated metadata profile document or namespace document.
LSID - a life science identifier. A URN-based identifier for a piece of Web-based biological information. LSIDs occupy one namespace (urn:lsid) in the URN naming scheme.
microformats - markup that allows the expression of semantics in an HTML (or XHTML) web page. Programs can extract meaning from a standard web page that is marked up with microformats.
Ning - an online platform for creating social networks and websites
ODS - OpenLink Data Spaces . A new generation distributed collaborative application platform for creating Semantic Web presence via Data Spaces derived from: weblogs, wikis, feed aggregators, photo galleries, shared bookmarks, discussion forums, and more.
PingTheSemanticWeb - a repository for RDF documents. You can notify this service that you have created or updated an RDF document on your web site, or you can import a list of recently created or updated RDF documents.
RDF browser - a piece of technology that enables you to browse RDF data sources by traversing data links. The key difference between this approach and traditional browsing is that RDF data links are typed (they possess inherent meaning and context) whereas traditional HTML links are untyped. There are a number of RDF Browsers currently available, including OpenLink's RDF Browser , which is a component of OAT (OpenLink Ajax Toolkit), Tabulator and DISCO .
Spotlight - a file system metadata extraction and search facility in Mac OS X.
structured data - data organized into semantic chunks or entities, with similar entities grouped together in relations or classes. (Michael Bergman provides an in-depth discussion of current Semantic Web terminology, and proposals for bringing more clarity to this area, in his post More Structure, More Terminology and (hopefully) More Clarity ).
URIQA - The URI Query Agent Model . A model for interacting with Semantic Web enabled web servers. It introduces new HTTP methods to indicate to a web server that, for a given resource URI, it should return a concise bounded description of that resource rather than a representation of it.
URN - Uniform Resource Name. A form of Uniform Resource Identifier (URI) which uniquely identifies a resource but which, unlike a Uniform Resource Locator (URL), does not specify its location.
VAD - Virtuoso Application Distribution. A packaging and distribution system for extending Virtuoso. A VAD encapsulates the components of a self-contained Virtuoso application, including table creation, default data, stored procedures, web services, and content. VADs are easily installed through Virtuoso's Conductor browser interface.