Virtuoso Sponger
Extracting RDF Structured Data from
Non-RDF Sources
Growing the Semantic Web
- Classic "chicken 'n' egg" problem has impeded the growth of the Semantic Web
- Development of applications for the Semantic Web will remain small-scale without a critical mass of RDF data.
- A critical mass of RDF data won't be achieved without adequate Semantic Web applications and tools.
- A new class of tools is emerging in response to this need..."RDFizers"
- Transform non-RDF data into RDF
-
Virtuoso Sponger is one such RDFizer
Virtuoso Sponger
- An RDFizer introduced in Virtuoso 5.0
- Provides built-in RDF middleware for transforming non-RDF data into RDF "on the fly".
- You can use non-RDF data sources as Semantic Web data sources.
-
Inputs: Wide variety of non-RDF Web data sources, e.g:
- (X)HTML Web Pages (including hosted microformats)
- Web services (
Google
,
Del.icio.us
,
Flickr
etc.)
- Binary files (MS Office, PDF, OpenDocument etc.)
-
Output: RDF structured data
Inputs: Supported Data Sources
- RDF (inc. N3, Turtle)
-
SIOC, SKOS, FOAF, AtomOWL, Annotea ...
- (X)HTML pages
-
HTML header metadata: Dublin Core
- Microformats: eRDF, RDFa, hCard, hCalendar, XFN, xFolk ...
- Syndication formats
- RSS 2.0, ATOM,
OPML,
OCS,
XBEL
-
GRDDL
- Web service APIs: Google Base, Flickr, Del.icio.us, Ning ...
- Files:
- Binary files: MS Office, OpenOffice, images, audio, video ...
- Data exchange formats: iCalendar, vCard
- 3rd party metadata extractors: Aperture, Spotlight, SIMILE RDFizers
- or add your own!
Output: Structured Data
In the context of the Semantic Data Web:
"Data organized into semantic chunks or entities, with similar entities grouped together into relations or classes"
Michael Bergman (http://www.mkbergman.com)
Article: "More Structure, More Terminology and (hopefully) More Clarity"
Sponger Benefits
- Majority of the world's data resides in non-RDF form at the current time
- Sponger provides a "Swiss army knife" for RDF structured data generation from non-RDF sources
- Extracting data from non-RDF Web sources and converting it to RDF
- helps "bootstrap" the Semantic Web
- helps drive the transition of the traditional Document-Web into the emerging Semantic Data-Web
- exposes the data in a canonical form for querying and inference
Sponger Architecture
- Sponger is comprised of Sponger Cartridges
- Default cartridge collection is bundled as a Virtuoso VAD
-
Cartridge = Metadata Extractor + Ontology Mapper
- Metadata extracted from non-RDF resourcEs is mapped to a suitable ontology by Ontology Mapper to produce Structured Data
- Sponger is highly customizable
- Custom cartridges can be developed
- Using any language (e.g. Virtuoso PL, C/C++, Java) supported by Virtuoso Server Extensions API
Using The Sponger
Can be invoked in several ways, via:
- Virtuoso SPARQL query processor
- REST API (/about/html/<URI> or /about/rdf/<URI> )
- E.g. http://dbpedia.org/resource/DBpedia
-
OpenLink
RDF client applications
- ODS-Briefcase (Virtuoso WebDAV)
- Directly through Virtuoso PL
Using the Sponger:
SPARQL Query Processor
- Virtuoso extends SPARQL with IRI/URI dereferencing
- Highly distributed nature of Semantic Web makes it highly unlikely all the referenced resources/IRIs will be in the local quad store
- During query execution:
- From a given IRI, remote RDF resources can be downloaded, parsed & the resulting triples stored in local quad store
- IRI dereferencing of FROM clauses
- Downloads & stores triples from named graphs
- IRI dereferencing of SPARQL variables
- Downloads & stores triples based on proximity search from a starting IRI to a given depth (# of hops) via specified predicates
SPARQL Extensions:
IRI Dereferencing of FROM Clauses
Enabled through 'define get:...' pragmas
DEFINE get:method "GET"
DEFINE get:soft "soft"
SELECT ?id
FROM NAMED <http://myhost/user1.ttl>
FROM NAMED http://myhost/user2.ttl
WHERE { GRAPH ?g { ?id a ?o } };
-
get:soft - retrieval mode: "soft" / "replace"
-
get:uri - IRI to retrieve if not equal to IRI of FROM clause
-
get:method - HTTP "GET" or URIQA "MGET"
-
get:refresh - max allowed age (seconds) of cached resource
- can reduce expiry time specified in HTTP headers
-
get:proxy - proxy server address if direct download not possible
SPARQL Extensions:
IRI Dereferencing of Variables
Enabled through 'define input:grab-...' pragmas
DEFINE input:grab-var "?more"
DEFINE input:grab-depth 10
DEFINE input:grab-limit 100
DEFINE input:grab-base "http://myhost/"
SELECT ?id ?fullname ?email
WHERE { GRAPH ?g {
?id a <Person> ; <FullName> ?fullname ; <Email> ?email .
OPTIONAL { ?id <SeeAlso> ?more }
} } ;
-
input:grab-var - SPARQL variable identifying IRIs to be downloaded
-
input:grab-depth - max # of links (predicates) between nodes in graph
-
input:grab-limit - max # of resources (subject/object nodes) to retrieve
-
input:grab-base - base IRI for converting relative IRIs to absolute plus others (grab-seealso, grab-destination ...) - see Reference Manual.
Using the Sponger:
RDF Proxy Service
- Sponger functionality is also exposed by Virtuoso "/proxy/rdf" endpoint
- An in-built REST style Web service
- Takes a target URL & returns its content "as is" or tries to transform it (by sponging) to RDF
http://demo.openlinksw.com/proxy/rdf/http://www.w3.org/People/Berners-Lee/card
- Provides a "pipe" for RDF browsers to browse non-RDF sources
- Caches to temporary Virtuoso storage
- Cache invalidation similar to traditional Web Browser, based on HTTP 'expires' header
RDF Proxy Service
Parameters:
-
url: the URL of the target
-
force: if 'rdf' is specified, will try to extract RDF data from the target and return it
-
header: HTTP headers to be sent to the target
-
output format: output MIME type of the RDF data
- 'rdf+xml' (default) / 'n3' / 'turtle' / 'ttl'
- if not specified, proxy service uses content negotiation
Using the Sponger:
OpenLink RDF Client Applications
Bundled as part of
OpenLink
AJAX Toolkit (OAT)
RDF Browser
- Uses /proxy service by default
iSPARQL - Interactive SPARQL query builder
- Uses /sparql service & 5 sponging settings (translated to IRI dereferencing pragmas on server)
- Get Local Data Only
- Get Remote Data When Missing Locally
- Get All Remote Data
- Get All Remote Data & Related Data
- Get Everything
Using the Sponger:
ODS-Briefcase (Virtuoso WebDAV)
Briefcase = A component of
OpenLink Data Spaces
Includes high level interface to Virtuoso WebDAV repository
- Web browser based interaction
-
Web Services support (direct use of WebDAV protocol)
- SPARQL queryable (WebDAV location acts as RDF graph URI)
- Metadata automatically extracted at file upload time
- Wide variety of file formats supported
- All WebDAV resources are exposed as SIOC instance data
- Extracted metadata available in two forms
- Pure WebDAV
- RDF (RDF/XML, FOAF, Turtle) optionally synchronized with Quad Store
- Virtuoso Content Crawler / RDF_Sink folder help automate uploading
SIOC as a Data Space "Glue" Ontology
- ODS has its own built-in cartridges for mapping to SIOC
- All ODS data containers (ODS-Briefcase, ODS-Weblog, ODS-Wiki, ODS-FeedManager etc) expose their data as SIOC instance data
- SIOC
- provides a generic data model of containers, items and associations between items
- Classes include: User, UserGroup, Role, Site, Forum, Post
- SIOC Types Module (sioc-t) defines further types.
- Classes include: AddressBook, BookmarkFolder, ImageGallery etc etc
- permits the use of other ontologies (e.g. FOAF) when describing attributes of SIOC entities
- provides a generic wrapper ("glue" ontology) for describing RDF structured data derived from
OpenLink Data Spaces
- All ODS-related SIOC data can be queried through SPARQL
Using the Sponger:
Directly via Virtuoso PL
- Sponger cartridges are invoked through a cartridge hook
- Provides a Virtuoso PL entry point to the packaged functionality
- Can be called directly from your own Virtuoso PL procedures
Sponger Architecture
- Sponger is comprised of cartridges
-
Cartridge = metadata extractor + ontology mapper
- Cartridge is invoked through cartridge hook (Virtuoso PL entry point)
-
Metadata extractor
- Performs initial data extraction
-
Ontology mapper
- Generates RDF instance data from extracted (non-RDF) metadata
- Extracted metadata is mapped to an ontology associated (via an internal mapping table) with the data source type
- Typically uses XSLT (GRDDL or in-built Virtuoso mapping scheme) or Virtuoso PL
Sponger Cartridge Invocation
Sponger Configuration using Conductor UI
- Virtuoso Conductor provides a browser-based graphical UI for most Virtuoso administration tasks
- including managing Sponger Cartridges and VADs
- VAD = Virtuoso Application Distribution
- Packaging & distribution system for Virtuoso extensions
- RDF Cartridges VAD
- Bundles a variety of pre-built cartridges for popular Web resources and file types
- Installed as part of default Virtuoso installation
Sponger Configuration using Conductor UI: RDF Cartridges Pane
Sponger Configuration using Conductor UI: GRDDL Filters
Sponger Configuration using Conductor UI: XSLT Templates
Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies
Custom Cartridges
- Sponger is extensible via pluggable cartridge architecture
- Sponge new data formats by creating your own cartridges
- Use Virtuoso PL or any language supported by Virtuoso Server Extensions API (incl: C/C++, Java)
- Register your cartridge in the Cartridge Registry (SYS_RDF_CARTRIDGES table) before use using Conductor or DML
Custom Cartridges
Cartridge Hook - Virtuoso PL Prototype
in graph_iri varchar: IRI of graph being retrieved
in new_origin_uri varchar: URI of the document being retrieved
in destination varchar: destination graph IRI
inout content any: the document content
inout async_queue any: preallocated asynchronous queue used to call the configured ping service
inout ping_service any: URL of the ping service, as assigned to the PingService parameter in the [SPARQL] section of the virtuoso.ini file. This argument could be used to notify the PingTheSemanticWeb RDF document repository & notification service
inout api_key any: unique string providing cartridge specific data taken from the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table
Flickr Cartridge Extracts
procedure DB.DBA.RDF_LOAD_FLICKR_IMG (
in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
inout _ret_body any, inout aq any, inout ps any, inout _key any)
{
declare xd, xt, url, tmp, api_key, img_id, hdr, exif any;
...
url := sprintf ('http://api.flickr.com/services/rest/?method=
flickr.photos.getInfo&photo_id=%s&api_key=%s', img_id, api_key);
tmp := http_get (url, hdr);
...
xd := xtree_doc (tmp);
...
xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', xd, vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif));
xd := serialize_to_UTF8_xml (xt);
DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));
return 1;
}
Custom Resolvers
- Sponger supports pluggable "Custom Resolver" cartridges
- Support dereferencing of other forms of URIs besides HTTP URLs, e.g:
-
URN schemes (LSIDs) and handle schemes (DOIs)
- Greatly extends range of data sources which can be linked into the Semantic Web
http://demo.openlinksw.com/sparql?default-graph-uri= urn:lsid:ubio.org:namebank:11815
&should-sponge=soft&query=SELECT+
*+WHERE+{?s+?p+?o}&format=text/html
Proxy service also recognizes URNs
http://demo.openlinksw.com/proxy?url=
urn:lsid:ubio.org:namebank:11815&force=rdf