OpenLink Virtuoso (Product Blog)

#ods_bar { margin: 0; padding: 0; width: 100%; float: left; clear: both; color: #444; font-size: 9pt; font-family: Arial, Helvetica, sans-serif; background-color: #ddeff9} #ods_bar ul { list-style-type: none} #ods_bar ul li { display: inline} #ods_bar a { text-decoration: none; color: inherit} #ods_bar img { float: none; border: 0; margin: 0} #ods_bar input { margin-right: 8px; font-size: 7pt; color: #555;} #ods_bar_handle { width: 10px; float: left} #ods_bar_content { float: left; width: 100%; background-color: #ddeff9} #ods_bar_top { float: left; width: 100%; background-color: #fff} #ods_bar_bot { float: left; clear: left; width: 100%; padding-top: 2px; padding-bottom: 2px; background-color: #85b9d2} #ods_bar_top_cmds { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px} #ods_bar_top_cmds img { vertical-align: middle;} #ods_bar_top_cmds a { text-decoration: none} #ods_bar_top_cmds a.user_profile_lnk { text-transform: none} #ods_bar_first_lvl { float: left; padding: 0; margin: 0; color: #fff; background: #0075A8 url("/ods/images/navlv1default.png")} #ods_bar_first_lvl li { padding: 0; padding-left: 4px; margin: 0} #ods_bar_first_lvl li a { margin-top: 0px; padding: 6px 3px 6px 3px; vertical-align: middle; color: #fff; /* Required due to buggy CSS in IE */} #ods_bar_first_lvl li a img { margin-top: 2px; margin-bottom: 5px; vertical-align: middle;} #ods_bar_first_lvl li.sel a { color: #455; background: #b1d4e5 url("/ods/images/navlv1sel.png")} #ods_bar_second_lvl { width: 100%; height: 20px; float: left; clear: left; margin: 0; padding: 0; padding-top: 4px; background: #ddeff9 url("/ods/images/navlv2default.png")} #ods_bar_second_lvl li { margin-right: 5px} #ods_bar_second_lvl li:first-child { margin-left: 27px;} #ods_bar_second_lvl li a { vertical-align: middle; color: #444; /* Required by buggy IE CSS implementation */ } #ods_bar_home_path { margin: 2px 0px 0px 36px; padding: 0; font-size: 8pt} .popup { position: absolute; background-color: #fff; border: 1px dotted #4800F4; padding: 0.5em; font-size: 80%; } #ods_bar_odslogin { font-size: 7.5pt; margin-top: 4px; color: #42abc4; background-color: #fff; float: right; padding-right: 8px; } #ods_bar_odslogin img { vertical-align: middle; margin-left: 8px; } #ods_bar_odslogin a { margin-left: 3px; color: inherit; text-decoration: none; }

Entries: [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 ]

Details

Virtuoso Data Space Bot

Burlington, United States

FOAF

Full profile

OCS 0.5

OPML 1.0

XBEL

Multimedia

Videos

Audio

Images

iTunes Subscription

Post Categories

ALL

Enterprise Application Integration

Enterprise Information Integration

HTTP & WebDAV

SQL Database

SQL/XML (SQLX)

Universal Server

Virtual Database Technology

Web Services Platform

Weblog Technology

XML Database (XSL-T, XPath, XQuery, and XML Schema)

Display Settings

articles per page.

order.

500 Billion Triples Bulk Load Experiment

Before Christmas, I wrote about a rerun of BSBM to check how it works before doing it on 500 Giga-triples ("Gt") at CWI. Now we can talk about the experiment with CWI's Scilens cluster. The specs are in the previous post. This is a cautionary tale about large data on one hand, and about high load speed on the other.

The BSBM generator and driver are an ever-fresh source of new terrors. The nature of the gig is that you have a window to do your experiment in, and that involves first generating the test data. It is somewhere around 3 TB of gzipped files. It took a whole week to make the files. During that time of course you want to anticipate what's going to break with the queries. So while the generator was going, we loaded 50 renamed replicas of the 10 Gt dataset. At partial capacity, we may add, because 4 boxes had half memory taken by the BSBM generator. We hate that program. Of course nobody gives a damn about it so it has been maintained in the worst way possible; for example, the way its cluster version generates slices of data is by having every instance actually generate the full data set, but only write 1 out of so many items to the output. So no amount of capacity will make it faster. As for BSBM itself, if you generate 10 Gt once and occasionally use this as test data, it does not inconvenience you so much. Then, of course, the test driver was patched to generate queries against renamed replicas of a dataset. But then the new driver would not read the dataset summary files made by the previous driver, because of Java class versions. 8 hours to regenerate 10 Gt. A real train wreck. This is by far not the end of it but we are out of space. So on with it; may that program be buried.

In the end, the 2000 gz files with the 500 Gt in them were complete. Then it turns out each file has tens of millions of namespace prefixes at the beginning. So, starting to load a file grows the process by some 9 GB just for the prefixes. So, out of 256 GB of RAM per box, there are about 72 GB taken by the prefixes, if you load 8 files in parallel on each. Well, one could do a sed script to unzip, expand the prefixes, and rezip, and the file would not be any bigger; but it would be a day to run.

So, anyway with 12 boxes, 24 processes, and (in principle) 384 threads, the load rate is between 3 and 4 million triples per second ("Mt/s"). With 2 boxes, it is 630 Kt/s, so you would say this is scalable. Near enough to linear; the 2 boxes have 12 cores and 2.3GHz, Scilens has 16 at 2.0GHz; close enough.

For the 3-4 Mt rate, there is an average of 200 threads running. This is not full platform, as there's the 2nd thread of each core idle for the most part. Adding the second thread usually adds some 30% throughput. A high of 5 Mt/s could be had if going to full CPU, but doubling the files being loaded would run out of memory because of the namespace prefixes. See, it is sheer luck that the BSBM thing, inept as it is, is still marginally usable, despite the prefixes and the horrible generator. A bit worse still, and it would have been a non-starter. It comes from the times when RDF just meant inept database, so scalability clearly was not in its design objectives.

With 96 files being loaded across the cluster, we got the run stats below for a couple of 4 minute windows. In each, the data size at time of the sample is between 50 Gt and 100 Gt. The long line is the cluster status summary; the tables below are load rates in the windows between timestamps, so, growth in triple count as triples per second (tps) since the previous sample.

Cluster 24 nodes, 240 s. 18866 m/s 692017 KB/s 21842% cpu 7% read 95% clw threads 356r 0w 114i buffers 99250961 97503789 d 2275 w 0 pfs

load rate (tps)	timestamp
3,853,915.028323892	2014-01-04 08:38:36 +0000
4,245,681.678456353	2014-01-04 08:38:33 +0000
3,680,757.080973009	2014-01-04 08:38:06 +0000
4,138,599.125958298	2014-01-04 08:38:03 +0000
4,887,272.575808064	2014-01-04 08:37:36 +0000
4,093,772.082515462	2014-01-04 08:37:33 +0000
4,399,343.552149284	2014-01-04 08:37:06 +0000
4,184,758.045998296	2014-01-04 08:37:03 +0000
3,884,665.444851716	2014-01-04 08:36:36 +0000
4,197,270.027036035	2014-01-04 08:36:33 +0000

Some hours later --

Cluster 24 nodes, 240 s. 14601 m/s 506784 KB/s 19721% cpu 61% read 1310% clw threads 374r 0w 126i buffers 189886490 107378792 d 1983 w 18 pfs

load rate (tps)	timestamp
3,273,757.708076397	2014-01-04 11:49:53 +0000
3,274,119.596013466	2014-01-04 11:49:53 +0000
3,318,539.715342822	2014-01-04 11:49:23 +0000
3,318,701.609946335	2014-01-04 11:49:23 +0000
3,127,730.142328589	2014-01-04 11:48:53 +0000
3,127,731.608946369	2014-01-04 11:48:53 +0000
3,273,572.647578414	2014-01-04 11:48:23 +0000
3,273,622.779240692	2014-01-04 11:48:23 +0000
2,872,466.21779274	2014-01-04 11:47:53 +0000
2,872,495.383487217	2014-01-04 11:47:53 +0000

Pretty good. I don't know of others coming even close.

Next we will look at query plans and scalability in query processing.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

01/06/2014 11:50 GMT-0500

Modified: 01/06/2014 17:47 GMT-0500

ESWC 2013 Panel - Semantic Technologies for Big Data Analytics: Opportunities and Challenges

I was invited to the ESWC 2013 "Semantic Technologies for Big Data Analytics: Opportunities and Challenges" panel on 29th May 2013 in Montpellier, France. The panel was moderated by Marko Grobelnik (JSI), with panelists Enrico Motta (KMi), Manfred Hauswirth (NUIG), David Karger (MIT), John Davies (British Telecom), José Manuel Gómez Pérez (ISOCO) and Orri Erling (myself).

Marko opened the panel by looking at the Google Trends search statistics for big data, semantics, business intelligence, data mining, and other such terms. Big data keeps climbing its hype-cycle hill, now above semantics and most of the other terms. But what do these in fact mean? In the leading books about big data, the word semantics does not occur.

I will first recap my 5 minute intro, and then summarize some questions and answers. This is from memory and is in no sense a full transcript.

Presentation

Over the years we have maintained that what the RDF community most needs is good database. Indeed, RDF is relational in essence and, while it requires some new datatypes and other adaptations, there is nothing in it that is fundamentally foreign to RDBMS technology.

This spring, we came through on the promise, delivering Virtuoso 7, packed full of all the state-of-the-art tricks in analytics-oriented databasing, column-wise compressed storage, vectored execution, great parallelism, and flexible scale-out.

At this same ESWC, Benedikt Kaempgen and Andreas Harth presented a paper (No Size Fits All -- Running the Star Schema Benchmark with SPARQL and RDF Aggregate Views) comparing Virtuoso and MySQL on the star schema benchmark at 1G scale. We redid their experiments with Virtuoso 7 at 30x and at 300x the scale.

At present, when running the star schema benchmark in SQL, we outperform column-store pioneer MonetDB by a factor of 2. When running the same star schema benchmark in SPARQL against triples as opposed to tables, we see a slowdown of 5x. When scaling from 30 to 300G and from one to two machines, we get linear increase in throughput, 5x longer for 10x more data.

Coming back to MySQL, the run with 1G takes about 60 seconds. Virtuoso SPARQL does the same on 30x the data in 45 seconds. Well, you could say that we should go pick on somebody in our series and not MySQL, being not relevant for this. Comparing with MonetDB and other analytics column stores is of course more relevant.

For cluster scaling, one could say that star schema benchmark is easy, and so it is, but even with harder ones, which do joins across partitions all the time, like the BSBM BI workload, we get scaling that is close to linear.

So, for analytics, you can use SPARQL in Virtuoso, and run circles around some common SQL databases.

The difference between SQL and SPARQL comes from having no schema. Instead of scanning aligned columns in a table, you do an index lookup for each column. This is not too slow if there is locality, as there is, but still a lot more than when talking about a multicolumn column-compressed table. With more execution tricks, we can maybe cut this to 3x.

The beach-head of workable RDF-based analytics on schema-less data has been attained. Medium-scale data, to the single-digit terabytes, is OK on small clusters.

What about the future?

First, Big Data means more than querying. Before meaningful analytics can be done, the data must generally be prepared and massaged. This means fast bulk load and fast database-resident transformation. We have that via flexible, expressive, parallelizable stored procedures and run time hosting. One can do everything one does in MapReduce right inside the database.

Some analytics cannot be expressed in a query language. For example, graph algorithms like clustering generate large intermediate states and run in many passes. For this, bulk synchronous processing frameworks like Giraph are becoming popular. We can again do this right inside the DBMS, on RDF or SQL tables. There is great platform utilization and more flexibility than in strict BSP, while being able to do any BSP algorithm.

The history of technology is one of divergence followed by reintegration. New trends, like Column stores, RDF databases, key value stores, or MapReduce, start as one-off special-purpose products, and the technologies then find their way back into platforms addressing a broader functionality.

The whole semantic experiment might be seen as a break-away from the web, if also a little from database, for the specific purpose of exploring schemaless-ness, universal referenceability of data, self-describing data, and some inference.

With RDF, we see lasting value in globally consistent identifiers. The URI "superkey" is the ultimate silo-breaker. The future is in integrating more and more varied data and a schema-first approach is cost-prohibitive. If data is to be preserved over extended lengths of time, self-description is essential; the applications and people that produced the data might not be around. Same for publishing data for outside reuse.

In fact, many of these things are right now being pursued in mainstream IT. Everybody is reinventing the triple, whether by using non-first normal form key-value pairs in an RDB, tagging each row of a table with the name of the table, using XML documents, etc. The RDF model provides all these desirable features, but most applications that need these things do not run on RDF infrastructure.

Anyway, by revolutionizing RDF store performance, we make this technology a cost-effective alternative in places where it was not such before.

To get much further in performance, physical storage needs to adapt to the data. Thus, in the long term, we see RDF as a lingua franca of data interchange and publishing, supported by highly scalable and adaptive databases that exploit the structure implicit in the data to deliver performance equal to the best in SQL data warehousing. When we get the schema from the data, we have schema-last flexibility and schema-first performance. The genie is back in the bottle, and data models are unified.

Questions and Answers

Q: Is the web big data?

David Karger: No, the shallow web (i.e., static web pages for purposes of search) is not big data. One can put it in a box and search. But for purposes of more complex processing, like analytics on the structure of the whole web, this is still big data.

Q: I bet you still can't do analytics on a fast stream.

Orri Erling: I am not sure about that, because when you have a stream -- whether this is network management and denial of service detection, or managing traffic in a city -- you know ahead of time what peak volume you are looking at, so you can size the system accordingly. And streams have a schema. So you can play all the database tricks. Vectored execution will work there just as it does for query processing, for example.

Q: I did not mean storage, I meant analysis.

Orri Erling: Here we mean sliding windows and constant queries. The triple vs. row issue also seems the same. There will be some overhead from schema-lastness, but for streams, I would say each has a regular structure.

John Davies: For example, we gather gigabytes a minute of traffic data from sensors in the road network and all this data is very regular, with a fixed schema.

Manfred Hauswirth: Or is this always so? The internet of things has potentially huge diversity in schema, with everything producing a stream. The user of the stream has no control whatever on the schema.

Marko Grobelnik: Yes, we have had streams for a long time -- on Wall Street, for example, where these make a lot of money. But high frequency trading is a very specific application. There is a stream, some analytics, not very complicated, just fast. This is one specific solution, with fixed schema and very specific scope, no explicit semantics.

Q: What is big data, in fact?

David Karger: Computer science has always been about big data; it is just the definition of big that changes. Big data is something one cannot conveniently process on a computer system. Not without unusual tricks, where something trivial, like shortest path, becomes difficult just because of volume. So it is that big data is very much about performance, and performance is usually obtained by sacrificing the general for the specific. The semantic world on the other hand is after something very general and about complex and expressive schema. When data gets big, the schema is vanishingly small in comparison with the data, and the schema work gets done by hand; the schema is not the problem there. Big data is not very internetty either, because the 40 TB produced by the telescope are centrally stored and you do not download them or otherwise transport them very much.

Q: Now, what do each of you understand with semantics?

Manfred Hauswirth: The essential aspect is that data is machine interpretable, with sufficient machine readable context.

David Karger: Semantics has to do with complexity or heterogeneity in the schema. Big data has to do with large volume. Maybe semantic big data would be all the databases in the world with a million different schemas. But today we do not see such applications. If the volume is high, the schema is usually not very large.

Manfred Hauswirth: This is not so far as that, for example a telco has over a hundred distinct network management systems and each has a different schema.

Orri Erling: From the data angle, we have come to associate semantic with

schema-lastness
globally-resolvable identifiers
self-description

When people use RDF as a storage model, they mostly do so because of schema flexibility, not because of expressive schemas or inference. Some use a little inference, but inference or logics or considerations of knowledge representation do not in our experience drive the choice.

Conclusion

In conclusion, the event was rather peaceful, with a good deal of agreement between the panelists and audience and no heated controversy. I hoped to get some reaction when I said that semantics was schema flexibility, but apparently this has become a politically acceptable stance. In the golden days of AI this would not have been so. But then Marko Grobelnik did point out that the whole landscape has become data driven. Even in fields like natural language, one looks more at statistics than deep structure: For example, if a phrase is often found on Google, it is proper usage.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

06/04/2013 16:05 GMT-0500

Modified: 08/21/2015 14:15 GMT-0500

Virtuoso 7 Release

The quest of OpenLink Software is to bring flexibility, efficiency, and expressive power to people working with data. For the past several years, this has been focused on making graph data models viable for the enterprise. Flexibility in schema evolution is a central aspect of this, as is the ability to share identifiers across different information systems, i.e., giving things URIs instead of synthetic keys that are not interpretable outside of a particular application.

With Virtuoso 7, we dramatically improve the efficiency of all this. With databases in the billions of relations (also known as triples, or 3-tuples), we can fit about 3x as many relations in the same space (disk and RAM) as with Virtuoso 6. Single-threaded query speed is up to 3x better, plus there is intra-query parallelization even in single-server configurations. Graph data workloads are all about random lookups. With these, having data in RAM is all-important. With 3x space efficiency, you can run with 3x more data in the same space before starting to go to disk. In some benchmarks, this can make a 20x gain.

Also the Virtuoso scale-out support is fundamentally reworked, with much more parallelism and better deployment flexibility.

So, for graph data, Virtuoso 7 is a major step in the coming of age of the technology. Data keeps growing and time is getting scarcer, so we need more flexibility and more performance at the same time.

So, let’s talk about how we accomplish this. Column stores have been the trend in relational data warehousing for over a decade. With column stores comes vectored execution, i.e., running any operation on a large number of values at one time. Instead of running one operation on one value, then the next operation on the result, and so forth, you run the first operation on thousands or hundreds-of-thousands of values, then the next one on the results of this, and so on.

Column-wise storage brings space efficiency, since values in one column of a table tend to be alike -- whether repeating, sorted, within a specific range, or picked from a particular set of possible values. With graph data, where there are no columns as such, the situation is exactly the same -- just substitute the word predicate for column. Space efficiency brings speed -- first by keeping more of the data in memory; secondly by having less data travel between CPU and memory. Vectoring makes sure that data that are closely located get accessed in close temporal proximity, hence improving cache utilization. When there is no locality, there are a lot of operations pending at the same time, as things always get done on a set of values instead of on a single value. This is the crux of the science of columns and vectoring.

Of the prior work in column stores, Virtuoso may most resemble Vertica, well described in Daniel Abadi’s famous PhD thesis. Virtuoso itself is described in IEEE Data Engineering Bulletin, March 2012 (PDF). The first experiments in column store technology with Virtuoso were in 2009, published at the SemData workshop at VLDB 2010 in Singapore. We tried storing TPC H as graph data and in relational tables, each with both rows and columns, and found that we could get 6 bytes per quad space utilization with the RDF-ization of TPC H, as opposed to 27 bytes with the row-wise compressed RDF storage model. The row-wise compression itself is 3x more compact than a row-wise representation with no compression.

Memory is the key to speed, and space efficiency is the key to memory. Performance comes from two factors: locality and parallelism. Both are addressed by column store technology. This made me a convert.

At this time, we also started the EU FP7 project, LOD2, most specifically working with Peter Boncz of CWI, the king of the column store, famous for MonetDB and VectorWise. This cooperation goes on within LOD2 and has extended to LDBC, an FP7 for designing benchmarks for graph and RDF databases. Peter has given us a world of valuable insight and experience in all aspects of avant garde database, from adaptive techniques to query optimization and beyond. One thing that was recently published is the results for Virtuoso cluster at CWI, running analytics on 150 billion relations on CWI’s SciLens cluster.

The SQL relational table-oriented databases and property graph-oriented databases (Graph for short) are both rooted in relational database science. Graph management simply introduces extra challenges with regards to scalability. Hence, at OpenLink Software, having a good grounding in the best practices of relational columnar (or column-wise) database management technology is vital.

Virtuoso is more prominently known for high-performance RDF-based graph database technology, but the entirety of its SQL relational data management functionality (which is the foundation for graph store) is vectored, and even allows users to choose between row-wise and column-wise physical layouts, index by index.

It has been asked: is this a new NoSQL engine? Well, there isn’t really such a thing. There are of course database engines that do not have SQL support and it has become trendy to call them "NoSQL." So, in this space, Virtuoso is an engine that does support SQL, plus SPARQL, and is designed to do big joins and aggregation (i.e., analytics) and fast bulk load, as well as ACID transactions on small updates, all with column store space efficiency. It is not only for big scans, as people tend to think about column stores, since it can also be used in compact embedded form.

Virtuoso also delivers great parallelism and throughput in a scale-out setting, with no restrictions on transactions and no limits on joining. The base is in relational database science, but all the adaptations that RDF and graph workloads need are built-in, with core level support for run-time data-typing, URIs as native Reference types, user-defined custom data types, etc.

Now that the major milestone of releasing Virtuoso 7 (open source and commercial editions) has been reached, the next steps include enabling our current and future customers to attain increased agility from big (linked) open data exploits. Technically, it will also include continued participation in DBMS industry benchmarks, such as those from the TPC, and others under development via the Linked Data Benchmark Council (LDBC), plus other social-media-oriented challenges that arise in this exciting data access, integration, and management innovation continuum. Thus, continue to expect new optimization tricks to be introduced at frequent intervals through the open source development branch at GitHub, between major commercial releases.

Related

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [1]

05/13/2013 18:06 GMT-0500

Modified: 08/21/2015 14:17 GMT-0500

Developer Opportunities at OpenLink Software

If it is advanced database technology, you will get to do it with us.

We are looking for exceptional talent to implement some of the hardest stuff in the industry. This ranges from new approaches to query optimization; to parallel execution (both scale up and scale out); to elastic cloud deployments and self-managing, self-tuning, fault-tolerant databases. We are most familiar to the RDF world, but also have full SQL support, and the present work will serve both use cases equally.

We are best known in the realms of high-performance database connectivity middleware and massively-scalable Linked-Data-oriented graph-model DBMS technology.

We have the basics -- SQL and SPARQL, column store, vectored execution, cost based optimization, parallel execution (local and cluster), and so forth. In short, we have everything you would expect from a DBMS. We do transactions as well as analytics, but the greater challenges at present are on the analytics side.

You will be working with my team covering:

Adaptive query optimization -- interleaving execution and optimization, so as to always make the correct plan choices based on actual data characteristics
Self-managing cloud deployments for elastic big data -- clusters that can grow themselves and redistribute load, recover from failures, etc.
Developing and analyzing new benchmarks for RDF and graph databases
Embedding complex geospatial reasoning inside the database engine. We have the basic R-tree and the OGC geometry data types; now we need to go beyond this
Every type of SQL optimizer and execution engine trick that serves to optimize for TPC-H and DS.

What do I mean by really good? It boils down to being a smart and fast programmer. We have over the years talked to people, including many who have worked on DBMS programming, and found that they actually know next to nothing of database science. For example, they might not know what a hash join is. Or they might not know that interprocess latency is in the tens of microseconds even within one box, and that in that time one can do tens of index lookups. Or they might not know that blocking on a mutex kills.

If you do core database work, we want you to know how many CPU cache misses you will have in flight at any point of the algorithm, and how many clocks will be spent waiting for them at what points. Same for distributed execution: The only way a cluster can perform is having max messages with max payload per message in flight at all times.

These are things that can be learned. So I do not necessarily expect that you have in-depth experience of these, especially since most developer jobs are concerned with something else. You may have to unlearn the bad habit of putting interfaces where they do not belong, for example. Or to learn that if there is an interface, then it must pass as much data as possible in one go.

Talent is the key. You need to be a self-starter with a passion for technology and have competitive drive. These can be found in many guises, so we place very few limits on the rest. If you show you can learn and code fast, we don't necessarily care about academic or career histories. You can be located anywhere in the world, and you can work from home. There may be some travel but not very much.

In the context of EU FP7 projects, we are working with some of the best minds in database, including Peter Boncz of CWI and VU Amsterdam (MonetDB, VectorWise) and Thomas Neumann of Technical University of Munich (RDF3X, HYPER). This is an extra guarantee that you will be working on the most relevant problems in database, informed by the results of the very best work to date.

For more background, please see the IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, Special Issue on Column Store Systems.

All articles and references therein are relevant for the job. Be sure to read the CWI work on run time optimization (ROX), cracking, and recycling. Do not miss the many papers on architecture-conscious, cache-optimized algorithms; see the VectorWise and MonetDB articles in the bulletin for extensive references.

If you are interested in an opportunity with us, we will ask you to do a little exercise in multithreaded, performance-critical coding, to be detailed in a blog post in a few days. If you have done similar work in research or industry, we can substitute the exercise with a suitable sample of this, but only if this is core database code.

There is a dual message: The challenges will be the toughest a very tough race can offer. On the other hand, I do not want to scare you away prematurely. Nobody knows this stuff, except for the handful of people who actually do core database work. So we are not limiting this call to this small crowd and will teach you on the job if you just come with an aptitude to think in algorithms and code fast. Experience has pros and cons so we do not put formal bounds on this. "Just out of high school" may be good enough, if you are otherwise exceptional. Prior work in RDF or semantic web is not a factor. Sponsorship of your M.Sc. or Ph.D. thesis, if the topic is in our line of work and implementation can be done in our environment, is a further possibility. Seasoned pros are also welcome and will know the nature of the gig from the reading list.

We are aiming to fill the position(s) between now and October.

Resumes and inquiries can be sent to Hugh Williams, hwilliams@openlinksw.com. We will contact applicants for interviews.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

08/07/2012 13:21 GMT-0500

Virtuoso 6.2 brings New Features!

Virtuoso 6.2 introduces a major number of enhancements to areas including...

Linked Data Deployment
Linked Data Middleware
Data Virtualization
Dynamic Data Exchange & Data Replication
Security

Linked Data Deployment

Feature	Description	Benefit
Automatic Deployment	Linked Data Pages are now automatically published for every Virtuoso Data Object; users need only load their data into the RDF Quad Store.	Handcrafted URL-Rewrite Rules are no longer necessary.
HTTP Metadata Enhancements	HTTP `Link:` header is used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents.	Enables HTTP-oriented tools to work with such relationships and other metadata.
HTML Metadata Embedding	HTML resource `<head />` and `<link />` elements and their `@rel` attributes are used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents.	Enables HTML-oriented tools to work with such relationships and other metadata.
Hammer Stack Auto-Discovery Patterns	HTML resource `<head />` section and `<link />` elements, the HTTP `Link:` header, and XRD-based `"host-meta"` resources collectively provide structured metadata about Virtuoso hosts, associated Linked Data Spaces, and specific Data Items (Entities).	Enables humans and machines to easily distinguish between Descriptor Resources and their Subjects, irrespective of URI scheme.

Linked Data Middleware

Feature	Description	Benefit
New Sponger Cartridges	New cartridges (data access and transformation drivers) for Twitter, Facebook, Amazon, eBay, LinkedIn, and others.	Enable users and user agents to deal with the Sponged data spaces as though they were named graphs in a quad store, or tables in an RDBMS.
New Descriptor Pages	HTML-based descriptor pages are automatically generated.	Descriptor subjects, and the constellation of navigable attribute-and-value pairs that constitute their descriptive representation, are clearly identified.
Automatic Subject Identifier Generation	De-referenceable data object identifiers are automatically created.	Removes tedium and risk of error associated with nuance-laced manual construction of identifiers.
Support for OData, JSON, RDFa	Additional data representation and serialization formats associated with Linked Data.	Increases flexibility and interoperability.

Data Virtualization

Feature	Description	Benefit
Materialized RDF Views	RDF Views over ODBC/JDBC Data Sources can now (optionally) keep the Quad Store in sync with the RDBMS data source.	Enables high-performance Faceted Browsing while remaining sensitive to changes in the RDBMS data sources.
CSV-to-RDF Transformation	Wizard-based generation of RDF Linked Data from CSV files.	Speeds deployment of data which may only exist in CSV form as Linked Data.
Transparent Data Access Binding	SPASQL (SPARQL Query Language integrated into SQL) is usable over ODBC, JDBC, ADO.NET, OLEDB, or XMLA connections.	Enables Desktop Productivity Tools to transparently work with any blend of RDBMS and RDF data sources.

Dynamic Data Exchange & Data Replication

Feature	Description	Benefit
Quad Store to Quad Store Replication	High-fidelity graph-data replication between one or more database instances.	Enables a wide variety of deployment topologies.
Delta Engine	Automated generation of deltas at the named-graph-level, matches transactional replication offered by the Virtuoso SQL engine.	Brings RDF replication on par with SQL replication.
PubSubHubbub Support	Deep integration within Quad Store as an optional mechanism for shipping deltas.	Enables push-based data replication across a variety of topologies.

Security

Feature	Description	Benefit
WebID support at the DBMS core	Use WebID protocol for low-level ACL-based protection of database objects (RDF or Relational) and Web Services.	Enables application of sophisticated security and data access policies to Web Services (e.g., SPARQL endpoint) and actual DBMS objects.
Webfinger	Supports using `mailto:` and `acct:` URIs in the context of WebID and other mechanisms, when domain holders have published necessary XRDS resources.	Enables more intuitive identification of people and organizations.
Fingerpoint	Similar to Webfinger but does not require XRDS resources; instea,d it works directly with SPARQL endpoints exposed using auto-discovery patterns in the `<head />` section of HTML documents.	Enables more intuitive identification of people and organizations.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [5]

09/22/2010 17:08 GMT-0500

Modified: 08/21/2015 14:43 GMT-0500

The future of the Semantic Web? It?s already here | Cision Blog

December 11, 2008

The future of the Semantic Web? It’s already here

Author: Jay Krall

An expert weighs in on how Web 3.0 is about to make media monitoring easier

For public relations professionals, finding mentions about a particular brand or product is getting more challenging as the vast clutter of the Web continues to grow. While paid monitoring services like those offered by Cision and others can help, for those using free-text search engines like Google for media monitoring, combing through pages of irrelevant search results has become routine. For example, acronyms pose a problem: how many instances of the term “HP” referring to “horsepower” do you have to sift through to find articles about Hewlett-Packard products? Plenty.

Worse yet, the longer your queries get, the harder it is for search engines to find what you really want. It’s almost 2009. With all this technological innovation happening so fast, why does it seem like computers still can’t read very well? If they were more literate, the monitoring of media and social media for brand mentions would be a lot easier for everyone.

That’s just one practical argument for the importance of the Semantic Web. First described in 1999 by World Wide Web Consortium director Tim Berners-Lee, the Semantic Web, also referred to as Web 3.0, is often described as a vision for the next generation of the Web: pages that can search each other and pull from each other’s data intelligently, melding Web sites and news feeds into precisely honed, individual Web experiences. But actually, the technologies of the Semantic Web are already hard at work, thanks to a group of computer scientists from around the world who are making Berners-Lee’s vision a reality.

Kingsley Idehen, CEO of OpenLink Software, is one of those pioneers. He is one of the creators of DBpedia, a Semantic Web tool that culls data from Wikipedia in amazingly precise ways. The project is a collaboration of OpenLink Software, the University of Leipzig and Freie University Berlin. Simply put, it divides up the site’s information into tags, and uses those tags to develop searches in which the subject is clearly defined, using a computer language that could soon be applied all across the Web. Beginning in late 2006, a program assigned 274 million tags describing nearly 1 billion facts to catalog Wikipedia in this way using the Resource Description Framework (RDF), a commonly accepted format for Semantic Web applications.

( Full story ... )

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

12/11/2008 15:46 GMT-0500

DataSpaces Bulletin: December issue now online!

The highly anticipated December 2008 issue of the DataSpaces Bulletin is now available!

This month's DataSpaces contains material of interest to the Virtuoso developer and UDA user community alike —

Introduction to Virtuoso Universal Server (Cloud Edition).
Links to Virtuoso and Linked Data mailing lists.
UDA license management tips and tricks.

bookmark it! submit digg.com

digg it!

reddit!

# PermaLink Comments [0]

12/09/2008 13:21 GMT-0500

Modified: 12/09/2008 15:06 GMT-0500

Enterprise Databases get a grip on XML

Databases get a grip on XML
From Inforworld.

The next iteration of the SQL standard was supposed to arrive in 2003. But SQL standardization has always been a glacially slow process, so nobody should be surprised that SQL:2003 ? now known as SQL:200n ? isn?t ready yet. Even so, 2003 was a year in which XML-oriented data management, one of the areas addressed by the forthcoming standard, showed up on more and more developers? radar screens.