Details
Virtuso Data Space Bot
Burlington, United States
Subscribe
Post Categories
Recent Articles
Display Settings
|
Showing posts in all categories Refresh
Virtuoso loads 110,500 triples-per-second on LUBM 8000
LUBM load speed still seems to be a metric that is quoted in comparisons of RDF stores. Consequently, we too measured the load time of LUBM 8000, 1,068-million triples, on the newest Virtuoso.
The real time for the load was 161m 3s. The rate was 110,532 triples-per-second. The hardware was one machine with 2 x Xeon 5410 (quad core, 2.33 GHz) and 16G 6667 MHz RAM. The software was Virtuoso 6 Cluster, configured into 8 partitions (processes) — one partition per CPU core. Each partition had its database striped over 6 disks total; the 6 disks on the system were shared between the 8 database processes.
The load was done on 8 streams, one per server process. At the beginning of the load, the CPU usage was 740% with no disk; at the end, it was around 700% with 25% disk wait. 100% counts here for one CPU core or one disk being constantly busy.
The RDF store was configured with the default two indices over quads, these being GSPO and OGPS. Text indexing of literals was not enabled. No materialization of entailed triples was made.
In comparison, Bigdata reported 200K triples-per-second for the first 8000 LUBM universities on a 15 blade box. We expect to do about that much on one new dual Xeon board; we’ll publish this when this is done.
We think that LUBM loading is not a realistic benchmark for the world but since other people publish such numbers, so do we.
|
06/29/2009 12:12 GMT-0500
|
Modified:
06/29/2009 12:22 GMT-0500
|
Comparing Virtuoso Performance on Different Processors
Over the years we have run Virtuoso on different hardware. We will here give a few figures that help identify the best price point for machines running Virtuoso.
Our test is very simple: Load 20 warehouses of TPC-C data, and then run one client per warehouse for 10,000 new orders. The way this is set up, disk I/O does not play a role and lock contention between the clients is minimal.
The test essentially has 20 server and 20 client threads running the same workload in parallel. The load time gives the single thread number; the 20 clients run gives the multi-threaded number. The test uses about 2-3 GB of data, so all is in RAM but is large enough not to be all in processor cache.
All times reported are real times, starting from the start of the first client and ending with the completion of the last client.
Do not confuse these results with official TPC-C. The measurement protocols are entirely incomparable.
| Test |
Platform |
Load (seconds) |
Run (seconds) |
GHz / cores / threads |
| 1 |
Amazon EC2 Extra Large (4 virtual cores) |
340 |
42 |
1.2 GHz? / 4 / 1 |
| 1 |
Amazon EC2 Extra Large (4 virtual cores) |
305 |
43.3 |
1.2 GHz? / 4 / 1 |
| 2 |
1 x dual-core AMD 5900 |
263 |
58.2 |
2.9 GHz / 2 / 1 |
| 3 |
2 x dual-core Xeon 5130 ("Woodcrest") |
245 |
35.7 |
2.0 GHz / 4 / 1 |
| 4 |
2 x quad-core Xeon 5410 ("Harpertown") |
237 |
18.0 |
2.33 GHz / 8 / 1 |
| 5 |
2 x quad-core Xeon 5520 ("Nehalem") |
162 |
18.3 |
2.26 GHz / 8 / 2 |
We tried two different EC2 instances to see if there would be variation. The variation was quite small. The tested EC2 instances costs 20 US cents per hour. The AMD dual-core costs 550 US dollars with 8G. The 3 Xeon configurations are Supermicro boards with 667MHz memory for the Xeon 5130 ("Woodcrest") and Xeon 5410 ("Harpertown"), and 800MHz memory for the Nehalem. The Xeon systems cost between 4000 and 7000 US dollars, with 5000 for a configuration with 2 x Xeon 5520 ("Nehalem"), 72 GB RAM, and 8 x 500 GB SATA disks.
Caveat: Due to slow memory (we could not get faster within available time), the results for the Nehalem do not take full advantage of its principal edge over the previous generation, i.e., memory subsystem. We'll see another time with faster memories.
The operating systems were various 64 bit Linux distributions.
We did some further measurements comparing Harpertown and Nehalem processors. The Nehalem chip was a bit faster for a slightly lower clock but we did not see any of the twofold and greater differences advertised by Intel.
We tried some RDF operations on the two last systems:
| operation |
Harpertown |
Nehalem |
| Build text index for DBpedia |
1080s |
770s |
| Entity Rank iteration |
263s |
251s |
Then we tried to see if the core multithreading of Nehalem could be seen anywhere. To this effect, we ran the Fibonacci function in SQL to serve as an example of an all in-cache integer operation. 16 concurrent operations took exactly twice as long as 8 concurrent ones, as expected.
For something that used memory, we took a count of RDF quads on two different indices, getting the same count. The database was a cluster setup with one process per core, so a count involved one thread per core. The counts in series took 5.02s and in parallel they took 4.27s.
Then we took a more memory intensive piece that read the RDF quads table in the order of one index and for each row checked that there is the equal row on another, differently-partitioned index. This is a cross-partition join. One of the indices is read sequentially and the other at random. The throughput can be reported as random-lookups-per-second. The data was English DBpedia, about 140M triples. One such query takes a couple of minutes with a 650% CPU utilization. Running multiple such queries should show effects of core multithreading since we expect frequent cache misses.
- On the host OS of the Nehalem system —
| n |
cpu% |
rows per second |
| 1 query |
503 |
906,413 |
| 2 queries |
1263 |
1,578,585 |
| 3 queries |
1204 |
1,566,849 |
- In a VM under Xen, on the Nehalem system —
| n |
cpu% |
rows per second |
| 1 query |
652 |
799,293 |
| 2 queries |
1266 |
1,486,710 |
| 3 queries |
1222 |
1,484,093 |
- On the host OS of the Harpertown system —
| n |
cpu% |
rows per second |
| 1 query |
648 |
1,041,448 |
| 2 queries |
708 |
1,124,866 |
The CPU percentages are as reported by the OS: user + system CPU divided by real time.
So, Nehalem is in general somewhat faster, around 20-30%, than Harpertown. The effect of core multithreading can be noticed but is not huge, another 20% or so for situations with more threads than cores. The join where Harpertown did better could be attributed to its larger cache — 12 MB vs 8 MB.
We see that Xen has a measurable but not prohibitive overhead; count a little under 10% for everything, also tasks with no I/O. The VM was set up to have all CPU for the test and the queries did not do disk I/O.
The executables were compiled with gcc with default settings. Specifying -march=nocona (Core 2 target) dropped the cross-partition join time mentioned above from 128s to 122s on Harpertown. We did not try this on Nehalem but presume the effect would be the same, since the out-of-order unit is not much different. We did not do anything about process-to-memory affinity on Nehalem, which is a non-uniform architecture. We would expect this to increase performance since we have many equal size processes with even load.
The mainstay of the Nehalem value proposition is a better memory subsystem. Since the unit we got was at 800 MHz memory, we did not see any great improvement. So if you buy Nehalem, you should make sure it is with 1333 MHz memory, else the best case will not be over 50% over a 667 MHz Core 2-based Xeon.
Nehalem remains a better deal for us because of more memory per board. One Nehalem box with 72 GB costs less than two Harpertown boxes with 32 GB and offers almost the same performance. Having a lot of memory in a small space is key. With faster memory, it might even outperform two Harpertown boxes, but this remains to be seen.
If space were not a constraint, we could make a cluster of 12 small workstations for the price of our largest system and get still more memory and more processor power per unit of memory. The Nehalem box was almost 4x faster than the AMD box but then it has 9x the memory, so the CPU to memory ratio might be better with the smaller boxes.
|
05/28/2009 10:54 GMT-0500
|
Modified:
05/28/2009 11:15 GMT-0500
|
Social Web Camp (#5 of 5)
(Last of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
The social networks camp was interesting, with a special meeting around Twitter. Half jokingly, we (that is, the OpenLink folks attending) concluded that societies would never be completely classless, although mobility between, as well as criteria for membership in, given classes would vary with time and circumstance. Now, there would be a new class division between people for whom micro-blogging is obligatory and those for whom it is an option.
By my experience, a great deal is possible in a short time, but this possibility depends on focus and concentration. These are increasingly rare. I am a great believer in core competence and focus. This is not only for geeks — one can have a lot of breadth-of-scope but this too depends on not getting sidetracked by constant information overload.
Insofar as personal success depends on constant reaction to online social media, this comes at a cost in time and focus and this cost will have to be managed somehow, for example by automation or outsourcing. But if the social media is only automated fronts twitting and re-twitting among themselves, a bit like electronic trading systems do with securities, with or without human operators, the value of the medium decreases.
There are contradictory requirements. On one hand, what is said in electronic media is essentially permanent, so one had best only say things that are well considered. On the other hand, one must say these things without adequate time for reflection or analysis. To cope with this, one must have a well-rehearsed position that is compacted so that it fits in a short format and is easy to remember and unambiguous to express. A culture of pre-cooked fast-food advertising cuts down on depth. Real-world things are complex and multifaceted. Besides, prevalent patterns of communication train the brain for a certain mode of functioning. If we train for rapid-fire 140-character messaging, we optimize one side but probably at the expense of another. In the meantime, the world continues developing increased complexity by all kinds of emergent effects. Connectivity is good but don't get lost in it.
There is a CIA memorandum about how analysts misinterpret data and see what they want to see. This is a relevant resource for understanding some psychology of perception and memory. With the information overload, largely driven by user generated content, interpreting fragmented and variously-biased real-time information is not only for the analyst but for everyone who needs to intelligently function in cyber-social space.
I participated in discussions on security and privacy and on mobile social networks and context.
For privacy, the main thing turned out to be whether people should be protected from themselves. Should information expire? Will it get buried by itself under huge volumes of new content? Well, for purposes of visibility, it will certainly get buried and will require constant management to stay visible. But for purposes of future finding of dirt, it will stay findable for those who are looking.
There is also the corollary of setting security for resources, like documents, versus setting security for statements, i.e., structured data like social networks. As I have blogged before, policies à la SQL do not work well when schema is fluid and end-users can't be expected to formulate or understand these. Remember Ted Nelson? A user interface should be such that a beginner understands it in 10 seconds in an emergency. The user interaction question is how to present things so that the user understands who will have access to what content. Also, users should themselves be able to check what potentially sensitive information can be found out about them. A service along the lines of Garlic's Data Patrol should be a part of the social web infrastructure of the future.
People at MIT have developed AIR (Accountability In RDF) for expressing policies about what can be done with data and for explaining why access is denied if it is denied. However, if we at all look at the history of secrets, it is rather seldom that one hears that access to information about X is restricted to compartment so-and-so; it is much more common to hear that there is no X. I would say that a policy system that just leaves out information that is not supposed to be available will please the users more. This is not only so for organizations; it is fully plausible that an individual might not wish to expose even the existence of some selected inner circle of friends, their parties together, or whatever.
In conclusion, there is no self-evident solution for careless use of social media. A site that requires people to confirm multiple times that they know what they are doing when publishing a photo will not get much use. We will see.
For mobility, there was some talk about the context of usage. Again, this is difficult. For different contexts, one would for example disclose one's location at the granularity of the city; for some other purposes, one would say which conference room one is in.
Embarrassing social situations may arise if mobile devices are too clever: If information about travel is pushed into the social network, one would feel like having to explain why one does not call on such-and-such a person and so on. Too much initiative in the mobile phone seems like a recipe for problems.
There is a thin line between convenience and having IT infrastructure rule one's life. The complexities and subtleties of social situations ought not to be reduced to the level of if-then rules. People and their interactions are more complex than they themselves often realize. A system is not its own metasystem, as Gödel put it. Similarly, human self-knowledge, let alone knowledge about another, is by this very principle only approximate. Not to forget what psychology tells us about state-dependent recall and of how circumstance can evoke patterns of behavior before one even notices. The history of expert systems did show that people do not do very well at putting their skills in the form of if-then rules. Thus automating sociality past a certain point seems a problematic proposition.
|
04/30/2009 12:14 GMT-0500
|
Modified:
04/30/2009 12:51 GMT-0500
|
Web Science and Keynotes at WWW 2009 (#4 of 5)
(Fourth of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There was quite a bit of talk about what web science could or ought to be. I will here comment a bit on the panels and keynotes, in no special order.
In the web science panel, Tim Berners-Lee said that the
deliverable of the web science initiative could be a way of making sense of all the world's data once the web had transformed into a database capable of answering arbitrary queries.
Michael Brodie of Verizon said that one deliverable would be a well considered understanding of the issue of counter-terrorism and civil liberties: Everything, including terrorism, operates on the platform of the web. How do we understand an issue that is not one of privacy, intelligence, jurisprudence, or sociology, but of all these and more?
I would add to this that it is not only a matter of governments keeping and analyzing vast amounts of private data, but of basically anybody who wants to do this being able to do so, even if at a smaller scale. In a way, the data web brings formerly government-only capabilities to the public, and is thus a democratization of intelligence and analytics. The citizen blogger increased the accountability of the press; the citizen analyst may have a similar effect. This is trickier though. We remember Jefferson's words about vigilance and the price of freedom. But vigilance is harder today, not because information is not there but because there is so much of it, with diverse spins put on it.
Tim B-L said at another panel that it seemed as if the new capabilities, especially the web as a database, were coming just in time to help us cope with the problems confronting the planet. With this, plus having everybody online, we would have more information, more creativity, more of everything at our disposal.
I'd have to say that the web is dual use: The bulk of traffic may contribute to distraction more than to awareness, but then the same infrastructure and the social behaviors it supports may also create unprecedented value and in the best of cases also transparency. I have to think of "For whosoever hath, to him shall be given." [Matthew 13:12] This can mean many things; here I am talking about whoever hath a drive for knowledge.
The web is both equalizing and polarizing: The equality is in the access; the polarity in the use made thereof. For a huge amount of noise there will be some crystallization of value that could not have arisen otherwise. Developments have unexpected effects. I would not have anticipated that gaming should advance supercomputing, for example.
Wendy Hall gave a dinner speech about communities and conferences; how the original hypertext conferences, with lots of representation of the humanities, became the techie WWW conference series; and how now we have the pendulum swinging back to more diversity with the web science conferences. So it is with life. Aside from the facts that there are trends and pendulum effects, and that paths that cross usually cross again, it is very hard to say exactly how these things play out.
At the "20 years of web" panel, there was a round of questions on how different people had been surprised by the web. Surprises ranged from the web's actual scalability to its rapid adoption and the culture of "if I do my part, others will do theirs." On the minus side, the emergence of spam and phishing were mentioned as unexpected developments.
Questions of simplicity and complexity got a lot of attention, along with network effects. When things hit the right simplicity at the right place (e.g., HTML and HTTP, which hypertext-wise were nothing special), there is a tipping point.
No barrier of entry, not too much modeling, was repeated quite a bit, also in relation to semantic web and ontology design. There is a magic of emergent effects when the pieces are simple enough: Organic chemistry out of a couple of dozen elements; all the world's information online with a few tags of markup and a couple of protocol verbs. But then this is where the real complexity starts — one half of it in the transport, the other in the applications, yet a narrow interface between the two.
This then begs the question of content- and application-aware networks. The preponderance of opinion was for separation of powers — keep carriers and content apart.
Michael Brodie commented in the questions to the first panel that simplicity was greatly overrated, that the world was in fact very complex. It seems to me that that any field of human endeavor develops enough complexity to fully occupy the cleverest minds who undertake said activity. The life-cycle between simplicity and complexity seems to be a universal feature. It is a bit like the Zen idea that "for the beginner, rivers are rivers and mountains are mountains, for the student these are imponderable mysteries of bewildering complexity and transcendent dimension but for the master these are again rivers and mountains." One way of seeing this is that the master, in spite of the actual complexity and interrelatedness of all things, sees where these complexities are significant and where not and knows to communicate concerning these as fits the situation.
There is no fixed formula for saying where complexities and simplicities fit, relevance of detail is forever contextual. For technological systems, we find that there emerge relatively simple interfaces on either side of which there is huge complexity: The x86 instruction set, TCP/IP, SQL, to name a few. These are lucky breaks, it is very hard to say beforehand where these will emerge. Object oriented people would like to see such everywhere, which just leads to problems of modeling.
There was a keynote from Telefonica about infrastructure. We heard that the power and cooling cost more than the equipment, that data centers ought to be scaled down from the football stadium and 20 megawatt scale, that systems must be designed for partitioning, to name a few topics. This is all well accepted. The new question is whether storage should go into the network infrastructure. We have blogged that the network will be the database, and it is no surprise that a telco should have the same idea, just with slightly different emphasis and wording. For Telefonica, this is about efficiency of bulk delivery, for us this is more about virtualized query-able dataspaces. Both will be distributed but issues of separation of powers may keep the two roles of network with storage separate.
In conclusion, the network being the database was much more visible and accepted this year than last. The linked data web was in Tim B-L's keynote as it was in the opening speech by the Prince of Asturias.
|
04/30/2009 12:00 GMT-0500
|
Modified:
04/30/2009 12:11 GMT-0500
|
Short Recap of Virtuoso Basics (#3 of 5)
(Third of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There are some points that came up in conversation at WWW 2009 that I will reiterate here. We find there is still some lack of clarity in the product image, so I will here condense it.
Virtuoso is a DBMS. We pitch it primarily to the data web space because this is where we see the emerging frontier. Virtuoso does both SQL and SPARQL and can do both at large scale and high performance. The popular perception of RDF and Relational models as mutually exclusive and antagonistic poles is based on the poor scalability of early RDF implementations. What we do is to have all the RDF specifics, like IRIs and typed literals as native SQL types, and to have a cost based optimizer that knows about this all.
If you want application-specific data structures as opposed to a schema-agnostic quad-store model (triple + graph-name), then Virtuoso can give you this too. Rendering application specific data structures as RDF applies equally to relational data in non-Virtuoso databases because Virtuoso SQL can federate tables from heterogenous DBMS.
On top of this, there is a web server built in, so that no extra server is needed for web services, web pages, and the like.
Installation is simple, just one exe and one config file. There is a huge amount of code in installers — application code and test suites and such — but none of this is needed when you deploy. Scale goes from a 25MB memory footprint on the desktop to hundreds of gigabytes of RAM and endless terabytes of disk on shared-nothing clusters.
Clusters (coming in Release 6) and SQL federation are commercial only; the rest can be had under GPL.
To condense further:
- Scalable Delivery of Linked Data
- SPARQL and SQL
- Arbitrary RDF Data + Relational
- Also From 3rd Party RDBMS
- Easy Deployment
- Standard Interfaces
|
04/30/2009 11:49 GMT-0500
|
Modified:
04/30/2009 12:11 GMT-0500
|
Search at WWW 2009 (#2 of 5)
(Second of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
There was a workshop on semantic search plus a number of papers and of course keynotes from Google and Yahoo.
A general topic was the use of and access to query logs. Are these the monopoly of GYM (Google, Yahoo, Microsoft) or should they be made more generally available? This is a privacy question. Use of query logs and click through of search results for improved ranking was mentioned many times throughout the conference.
The semantic search workshop was largely about benchmarks for keyword search in information retrieval. For linked data, which is a database proposition, these benchmarks are not really applicable. For document search aided by semantics derived by NLP, these are of course applicable. But there is a divide in approach.
Giovanni Tummarello presented Sig.ma, a service using Sindice's RDF index for collecting all RDF statements about entities matching some set of keywords. One could then choose which sources and which entities were the right ones. One could further store such a query and embed it on a page. The point was that the filtering done manually could be persisted and republished, so as to create dynamic content aggregated from selected live sources. Further speculating, one could use such user feedback for adjusting ranking, even though Sig.ma did not. We may adopt the idea of manually excluding sources into our browser too. Fresnel lenses are another thing to look at.
There was a paper by Josep M. Pujol and Pablo Rodriguez, of Telefonica Research, about returning search to the people by means of Porqpine, a peer-to-peer search implementation based on sharing search results from search engines among peers and indexing them locally as they were retrieved. For users with similar interests, this can give a community based ranking model but has issues of privacy. Another point was that with local processing and personal scale data volumes various kinds of brute force processing were feasible that would cost a lot for the web scale. Much can be done web scale but it must be done cleverly, not with a shell script and not so ad hoc.
As a counterpoint to this, there was a talk about Hadoop and Hive, a map-reduce-based SQL-like framework. One could do an SQL GROUP BY on text files with record parsing at run time, all spread over a Hadoop cluster. The issue is, if you have a petabyte of data, you may wish to run more than one ad hoc query on it. This means that joining between partitions and complex processing becomes important. This cannot be done without indices and complex query optimization, and needs a DBMS. Stonebraker and company are fully justified in their critique of map reduce. It looks like each generation must get dazzled by the oversimplified and then retrace the same discoveries of complexity as the previous one.
Some of our future plans were confirmed by what we saw, for example as concerns:
- Interactively selecting sources for search, showing the graphs, then interactively refining
- More social networks, more network analysis, and more work on social recommendation
- Real time indexing of new pings, filling the store by forwarding queries to search engines, and harvesting micro-formats from results
- Using entity extraction
These are all items in the pipeline, easy to do on top of the existing platform. For the machine learning and NLP parts, we will partner with others, details will be worked out while we work on the items we implement by ourselves.
|
04/30/2009 11:18 GMT-0500
|
Modified:
04/30/2009 12:51 GMT-0500
|
Linked Data at WWW 2009 (#1 of 5)
(First of five posts related to the WWW 2009 conference, held the week of April 20, 2009.)
We gave a talk at the Linked Open Data workshop, LDOW 2009, at WWW 2009. I did not go very far into the technical points in the talk, as there was almost no time and the points are rather complex. Instead, I emphasized what new things had become possible with recent developments.
The problem we do not cease hearing about is scale. We have solved most of it. There is scale in the schema: Put together, ontologies go over a million classes/properties. Which ones are relevant depends, and the user should have the choice. The instance data is in the tens of billions of triples, much derived from Web 2.0 sources but also much published as RDF.
To make sense of this all, we need quick summaries and search. Without navigation via joins, the value will be limited. Fast joining, counting, grouping, and ranking are key.
People will use different terms for the same thing. The issue of identity is philosophical. In order to do reasoning one needs strong identity; a statement like x is a bit like y is not very useful in a database context. Whether any x and y can be considered the same depends on the context. So leave this for query time. The conditions under which two people are considered the same will depend on whether you are doing marketing analysis or law enforcement. A general purpose data store cannot anticipate all the possibilities, so smush on demand, as you go, as has been said many times.
Against this backdrop, we offer a solution with which anybody who so chooses can play with big data, whether a search or analytics player.
We are going in the direction of more and more ad hoc processing at larger and larger scale. With good query parallelization, we can do big joins without complex programming. No explicit Map Reduce jobs or the like. What was done with special code with special parallel programming models, can now be done in SQL and SPARQL.
To showcase this, we do linked data search, browsing, and so on, but are essentially a platform provider.
Entry costs into relatively high end databases have dropped significantly. A cluster with 1 TB of RAM sells for $75K or so at today's retail prices and fits under a desk. For intermittent use, the rent for 1TB RAM is $1228 per day on EC2. With this on one side and Virtuoso on the other, a lot that was impractical in the past is now within reach. Like Giovanni Tummarello put it for airplanes, the physics are as they were for da Vinci but materials and engines had to develop a bit before there was commercial potential. So it is also with analytics for everyone.
A remark from the audience was that all the stuff being shown, not limited to Virtuoso, was non-standard, having to do with text search, with ranking, with extensions, and was in fact not SPARQL and pure linked data principles. Further, by throwing this all together, one got something overcomplicated, too heavy.
I answered as follows, which apparently cannot be repeated too much:
First, everybody expects a text search box, and is conditioned to having one. No text search and no ranking is a non-starter. Ceterum censeo, for database, the next generation cannot be less expressive than the previous. All of SQL and then some is where SPARQL must be. The barest minimum is being able to say anything one can say in SQL, and then justify SPARQL by saying that it is better for heterogenous data, schema last, and so on. On top of this, transitivity and rules will not hurt. For now, the current SPARQL working group will at least reach basic SQL parity; the edge will still remain implementation dependent.
Another remark was that joining is slow. Depends. Anything involving more complex disk access than linear reading of a blob is generally not good for interactive use. But with adequate memory, and with all hot spots in memory, we do some 3.2 million random-accesses-per-second on 12 cores, with easily 80% platform utilization for a single large query. The high utilization means that times drop as processing gets divided over more partitions.
There was a talk about MashQL by Mustafa Jarrar, concerning an abstraction on top of SPARQL for easy composition of tree-structured queries. The idea was that such queries can be evaluated "on the fly" as they are being composed. As it happens, we already have an XML-based query abstraction layer incorporated into Virtuoso 6.0's built-in Faceted Data Browser Service, and the effects are probably quite similar. The most important point here is that by using XML, both of these approaches are interoperable against a Virtuoso back-end. Along similar lines, we did not get to talk to the G Facets people but our message to them is the same: Use the faceted browser service to get vastly higher performance when querying against Linked Data, be it DBpedia or the entity LOD Cloud. Virtuoso 6.0 (Open Source Edition) "TP1" is now publicly available as a Technology Preview (beta).
We heard that there is an effort for porting Freebase's Parallax to SPARQL. The same thing applies to this. With a number of different data viewers on top of SPARQL, we come closer to broad-audience linked-data applications. These viewers are still too generic for the end user, though. We fully believe that for both search and transactions, application-domain-specific workflows will stay relevant. But these can be made to a fair degree by specializing generic linked-data-bound controls and gluing them together with some scripting.
As said before, the application will interface the user to the vocabulary. The vocabulary development takes the modeling burden from the application and makes for interchangeable experience on the same data. The data in turn is "virtualized" into the database cloud or the local secure server, as the use case may require.
For ease of adoption, open competition, and safety from lock-in, the community needs a SPARQL whose usability is not totally dependent on vendor extensions. But we might de facto have that in just a bit, whenever there is a working draft from the SPARQL WG.
Another topic that we encounter often is the question of integration (or lack thereof) between communities. For example, database conferences reject semantic web papers and vice versa. Such politics would seem to emerge naturally but are nonetheless detrimental. We really should partner with people who write papers as their principal occupation. We ourselves do software products and use very little time for papers, so some of the bad reviews we have received do make a legitimate point. By rights, we should go for database venues but we cannot have this take too much time. So we are open to partnering for splitting the opportunity cost of multiple submissions.
For future work, there is nothing radically new. We continue testing and productization of cluster databases. Just deliver what is in the pipeline. The essential nature of this is adding more and more cases of better and better parallelization in different query situations. The present usage patterns work well for finding bugs and performance bottlenecks. For presentation, our goal is to have third party viewers operate with our platform. We cannot completely leave data browsing and UI to third parties since we must from time to time introduce various unique functionality. Most interaction should however go via third party applications.
|
04/27/2009 17:28 GMT-0500
|
Modified:
04/28/2009 11:27 GMT-0500
|
Web Scale and Fault Tolerance
One concern about Virtuoso Cluster is fault tolerance. This post talks about the basics of fault tolerance and what we can do with this, from improving resilience and optimizing performance to accommodating bulk loads without impacting interactive response. We will see that this is yet another step towards a 24/7 web-scale Linked Data Web. We will see how large scale, continuous operation, and redundancy are related.
It has been said many times — when things are large enough, failures become frequent. In view of this, basic storage of partitions in multiple copies is built into the Virtuoso cluster from the start. Until now, this feature has not been tested or used very extensively, aside from the trivial case of keeping all schema information in synchronous replicas on all servers.
Approaches to Fault Tolerance
Fault tolerance has many aspects but it starts with keeping data in at least two copies. There are shared-disk cluster databases like Oracle RAC that do not depend on partitioning. With these, as long as the disk image is intact, servers can come and go. The fault tolerance of the disk in turn comes from mirroring done by the disk controller. Raids other than mirrored disk are not really good for databases because of write speed.
With shared-nothing setups like Virtuoso, fault tolerance is based on multiple servers keeping the same logical data. The copies are synchronized transaction-by-transaction but are not bit-for-bit identical nor write-by-write synchronous as is the case with mirrored disks.
There are asynchronous replication schemes generally based on log shipping, where the replica replays the transaction log of the master copy. The master copy gets the updates, the replica replays them. Both can take queries. These do not guarantee an entirely ACID fail-over but for many applications they come close enough.
In a tightly coupled cluster, it is possible to do synchronous, transactional updates on multiple copies without great added cost. Sending the message to two places instead of one does not make much difference since it is the latency that counts. But once we go to wide area networks, this becomes as good as unworkable for any sort of update volume. Thus, wide area replication must in practice be asynchronous.
This is a subject for another discussion. For now, the short answer is that wide area log shipping must be adapted to the application's requirements for synchronicity and consistency. Also, exactly what content is shipped and to where depends on the application. Some application-specific logic will likely be involved; more than this one cannot say without a specific context.
Basics of Partition Fail-Over
For now, we will be concerned with redundancy protecting against broken hardware, software slowdown, or crashes inside a single site.
The basic idea is simple: Writes go to all copies; reads that must be repeatable or serializable (i.e., locking) go to the first copy; reads that refer to committed state without guarantee of repeatability can be balanced among all copies. When a copy goes offline, nobody needs to know, as long as there is at least one copy online for each partition. The exception in practice is when there are open cursors or such stateful things as aggregations pending on a copy that goes down. Then the query or transaction will abort and the application can retry. This looks like a deadlock to the application.
Coming back online is more complicated. This requires establishing that the recovering copy is actually in sync. In practice this requires a short window during which no transactions have uncommitted updates. Sometimes, forcing this can require aborting some transactions, which again looks like a deadlock to the application.
When an error is seen, such as a process no longer accepting connections and dropping existing cluster connections, we in practice go via two stages. First, the operations that directly depended on this process are aborted, as well as any computation being done on behalf of the disconnected server. At this stage, attempting to read data from the partition of the failed server will go to another copy but writes will still try to update all copies and will fail if the failed copy continues to be offline. After it is established that the failed copy will stay off for some time, writes may be re-enabled — but now having the failed copy rejoin the cluster will be more complicated, requiring an atomic window to ensure sync, as mentioned earlier.
For the DBA, there can be intermittent software crashes where a failed server automatically restarts itself, and there can be prolonged failures where this does not happen. Both are alerts but the first kind can wait. Since a system must essentially run itself, it will wait for some time for the failed server to restart itself. During this window, all reads of the failed partition go to the spare copy and writes give an error. If the spare does not come back up in time, the system will automatically re-enable writes on the spare but now the failed server may no longer rejoin the cluster without a complex sync cycle. This all can happen in well under a minute, faster than a human operator can react. The diagnostics can be done later.
If the situation was a hardware failure, recovery consists of taking a spare server and copying the database from the surviving online copy. This done, the spare server can come on line. Copying the database can be done while online and accepting updates but this may take some time, maybe an hour for every 200G of data copied over a network. In principle this could be automated by scripting, but we would normally expect a human DBA to be involved.
As a general rule, reacting to the failure goes automatically without disruption of service but bringing the failed copy online will usually require some operator action.
Levels of Tolerance and Performance
The only way to make failures totally invisible is to have all in duplicate and provisioned so that the system never runs at more than half the total capacity. This is often not economical or necessary. This is why we can do better, using the spare capacity for more than standby.
Imagine keeping a repository of linked data. Most of the content will come in through periodic bulk replacement of data sets. Some data will come in through pings from applications publishing FOAF and similar. Some data will come through on-demand RDFization of resources.
The performance of such a repository essentially depends on having enough memory. Having this memory in duplicate is just added cost. What we can do instead is have all copies store the whole partition but when routing queries, apply range partitioning on top of the basic hash partitioning. If one partition stores IDs 64K - 128K, the next partition 128K - 192K, and so forth, and all partitions are stored in two full copies, we can route reads to the first 32K IDs to the first copy and reads to the second 32K IDs to the second copy. In this way, the copies will keep different working sets. The RAM is used to full advantage.
Of course, if there is a failure, then the working set will degrade, but if this is not often and not for long, this can be quite tolerable. The alternate expense is buying twice as much RAM, likely meaning twice as many servers. This workload is memory intensive, thus servers should have the maximum memory they can have without going to parts that are so expensive one gets a new server for the price of doubling memory.
Background Bulk Processing
When loading data, the system is online in principle, but query response can be quite bad. A large RDF load will involve most memory and queries will miss the cache. The load will further keep most disks busy, so response is not good. This is the case as soon as a server's partition of the database is four times the size of RAM or greater. Whether the work is bulk-load or bulk-delete makes little difference.
But if partitions are replicated, we can temporarily split the database so that the first copies serve queries and the second copies do the load. If the copies serving on line activities do some updates also, these updates will be committed on both copies. But the load will be committed on the second copy only. This is fully appropriate as long as the data are different. When the bulk load is done, the second copy of each partition will have the full up to date state, including changes that came in during the bulk load. The online activity can be now redirected to the second copies and the first copies can be overwritten in the background by the second copies, so as to again have all data in duplicate.
Failures during such operations are not dangerous. If the copies doing the bulk load fail, the bulk load will have to be restarted. If the front end copies fail, the front end load goes to the copies doing the bulk load. Response times will be bad until the bulk load is stopped, but no data is lost.
This technique applies to all data intensive background tasks — calculation of entity search ranks, data cleansing, consistency checking, and so on. If two copies are needed to keep up with the online load, then data can be kept just as well in three copies instead of two. This method applies to any data-warehouse-style workload which must coexist with online access and occasional low volume updating.
Configurations of Redundancy
Right now, we can declare that two or more server processes in a cluster form a group. All data managed by one member of the group is stored by all others. The members of the group are interchangeable. Thus, if there is four-servers-worth of data, then there will be a minimum of eight servers. Each of these servers will have one server process per core. The first hardware failure will not affect operations. For the second failure, there is a 1/7 chance that it stops the whole system, if it falls on the server whose pair is down. If groups consist of three servers, for a total of 12, the two first failures are guaranteed not to interrupt operations; for the third, there is a 1/10 chance that it will.
We note that for big databases, as said before, the RAM cache capacity is the sum of all the servers' RAM when in normal operation.
There are other, more dynamic ways of splitting data among servers, so that partitions migrate between servers and spawn extra copies of themselves if not enough copies are online. The Google File System (GFS) does something of this sort at the file system level; Amazon's Dynamo does something similar at the database level. The analogies are not exact, though.
If data is partitioned in this manner, for example into 1K slices, each in duplicate, with the rule that the two duplicates will not be on the same physical server, the first failure will not break operations but the second probably will. Without extra logic, there is a probability that the partitions formerly hosted by the failed server have their second copies randomly spread over the remaining servers. This scheme equalizes load better but is less resilient.
Maintenance and Continuity
Databases may benefit from defragmentation, rebalancing of indices, and so on. While these are possible online, by definition they affect the working set and make response times quite bad as soon as the database is significantly larger than RAM. With duplicate copies, the problem is largely solved. Also, software version changes need not involve downtime.
Present Status
The basics of replicated partitions are operational. The items to finalize are about system administration procedures and automatic synchronization of recovering copies. This must be automatic because if it is not, the operator will find a way to forget something or do some steps in the wrong order. This also requires a management view that shows what the different processes are doing and whether something is hung or failing repeatedly. All this is for the recovery part; taking failed partitions offline is easy.
|
04/01/2009 10:18 GMT-0500
|
Modified:
04/01/2009 11:18 GMT-0500
|
Beyond Applications - Introducing the Planetary Datasphere (Part 2)
We have looked at the general implications of the DataSphere, a universal, ubiquitous database infrastructure, on end-user experience and application development and content. Now we will look at what this means at the back end, from hosting to security to server software and hardware.
Application Hosting
For the infrastructure provider, hosting the DataSphere is no different from hosting large Web 2.0 sites. This may be paid for by users, as in the cloud computing model where users rent capacity for their own purposes, or by advertisers, as in most of Web 2.0.
Clouds play a role in this as places with high local connectivity. The DataSphere is the atmosphere; the Cloud is an atmospheric phenomenon.
What of Proprietary Data and its Security?
Having proprietary data does not imply using a proprietary language. I would say that for any domain of discourse, no matter how private or specialized, at least some structural concepts can be borrowed from public, more generic sources. This lowers training thresholds and facilitates integration. Being able to integrate does not imply opening one's own data. To take an analogy, if you have a bunker with closed circuit air recycling, you still breathe air, even if that air is cut off from the atmosphere at large. For places with complex existing RDBMS security, the best is to map the RDBMS to RDF on the fly, always running all requests through the RDBMS. This implicitly preserves any policy or label based security schemes.
What of Individual Privacy on the Open Web?
The more complex situations will be found in environments with mixed security needs, as in social networking with partly-open and partly-closed profiles. The FOAF+SSL solution with https:// URIs is one approach. For query processing, we have a question of enforcing instance-level policies. In the DataSphere, granting privileges on tables and views no longer makes sense. In SQL, a policy means that behind the scenes the DBMS will add extra criteria to queries and updates depending on who is issuing them. The query processor adds conditions like getting the user's department ID and comparing it to the department ID on the payroll record. Labeled security is a scheme where data rows themselves contain security tags and the DBMS enforces these, row by row.
I would say that these techniques are suited for highly-structured situations where the roles, compartments, and needs are clear, and where the organization has the database know-how to write, test, and deploy such rules by the table, row, and column. This does not sit well with schema-last. I would not bet much on an average developer's capacity for making airtight policies on RDF data where not even 100% schema-adherence is guaranteed.
Doing security at the RDF graph level seems more appropriate. In many use cases, the graph is analogous to a photo album or a file system directory. A Data Space can be divided into graphs to provide more granularity for expressing topic, provenance, or security. If policy conditions apply mostly to the graph, then things are not as likely to slip by, for example, policy rules missing some infrequent misuse of the schema. In these cases, the burden on the query processor is also not excessive: Just as with documents, the container (table, graph) is the object of access grants, not the individual sentences (DBMS records, RDF triples) in the document.
It is left to the application to present a choice of graph level policies to the user. Exactly what these will be depends on the domain of discourse. A policy might restrict access to a meeting in a calendar to people whose OpenIDs figure in the attendee list, or limit access to a photo album to people mentioned in the owner's social network. Defining such policies is typically a task for the application developer.
The difference between the Document Web and the Linked Data Web is that while the Document Web enforces security when a thing is returned to the user, Linked Data Web enforcement must occur whenever a query references something, even if this is an intermediate result not directly shown to the user.
The DataSphere will offer a generic policy scheme, filtering what graphs are accessed in a given query situation. Other applications may then verify the safety of one's disclosed information using the same DataSphere infrastructure. Of course, the user must rely on the infrastructure provider to correctly enforce these rules. Then again, some users will operate and audit their own infrastructure anyway.
Federation vs. Centralization
On the open web, there is the question of federation vs. centralization. If an application is seen to be an interface to a vocabulary, it becomes more agnostic with respect to this. In practice, if we are talking about hosted services, what is hosted together joins much faster. Data Spaces with lots of interlinking, such as closely connected social networks, will tend to cluster together on the same cloud to facilitate joint operation. Data is ubiquitous and not location-conscious, but what one can efficiently do with it depends on location. Joint access patterns favor joint location. Due to technicalities of the matter, single database clusters will run complex queries within the cluster 100 to 1000 times faster than between clusters. The size of such data clouds may be in the hundreds-of-billions of triples. It seems to make sense to have data belonging to same-type or jointly-used applications close together. In practice, there will arise partitioning by type of usage, user profile, etc., but this is no longer airtight and applications more-or-less float on top of all of this.
A search engine can host a copy of the Document Web and allow text lookups on it. But a text lookup is a single well-defined query that happens to parallelize and partition very well. A search engine can also have all the structured public data copied, but the problem there is that queries are a lot less predictable and may take orders of magnitude more resources than a single text lookup. As a partial answer, even now, we can set up a database so that the first million single-row joins cost the user nothing, but doing more requires a special subscription.
The cost for hosting a trillion triples will vary radically in function of what throughput is promised. This may result in pricing per service level, a bit like ISP pricing varies in function of promised connectivity. Queries can be run for free if no throughput guarantee applies, and might cost more if the host promises at least five-million joins-per-second including infrequently-accessed data.
Performance and cost dynamics will probably lead to the emergence of domain-specific clusters of colocated Data Spaces. The landscape will be hybrid, where usage drives data colocation. A single Google is not a practical solution to the world's spectrum of query needs.
What is the Cost of Schema-Last?
The DataSphere proposition is predicated on a worldwide database fabric that can store anything, just like a network can transport anything. It cannot enforce a fixed schema, just like TCP/IP cannot say that it will transport only email. This is continuous schema evolution. Well, TCP/IP can transport anything but it does transport a lot of HTML and email. Similarly, the DataSphere can optimize for some common vocabularies.
We have seen that an application-specific relational schema is often 10 times more efficient than an equivalent completely generic RDF representation of the same thing. The gap may narrow, but task specific representations will keep an edge. We ought to know, as we do both.
While anything can be represented, the masses are not that creative. For any data-hosting provider, making a specialized representation for the top 100 entities may cut data size in half or better. This is a behind-the-scenes optimization that will in time be a matter of course.
Historically, our industry has been driven by two phenomena:
-
New PCs every 2 years. To make this necessary, Windows has been getting bigger and bigger, and not upgrading is not an option if one must exchange documents with new data formats and keep up with security.
-
Agility, or ad hoc over planned. The reason the RDBMS won over CODASYL network databases was that one did not have to define what queries could be made when creating the database. With the Linked Data Web, we have one more step in this direction when we say that one does not have to decide what can be represented when creating the database.
To summarize, there is some cost to schema-last, but then our industry needs more complexity to keep justifying constant investment. The cost is in this sense not all bad.
Building the DataSphere may be the next great driver of server demand. As a case in point, Cisco, whose fortune was made when the network became ubiquitous, just entered the server game. It's in the air.
DataSphere Precursors
Right now, we have the Linked Open Data movement with lots of new data being added. We have the drive for data- and reputation-portability. We have Freebase as a demonstrator of end-users actually producing structured data. We have convergence of terminology around DBpedia, FOAF, SIOC, and more. We have demonstrators of useful data integration on the RDF stack in diverse fields, especially life sciences.
We have a totally ubiquitous network for the distribution of this, plus database technology to make this work.
We have a practical need for semantics, as search is getting saturated, email is getting killed by spam, and information overload is a constant. Social networks can be leveraged for solving a lot of this, if they can only be opened.
Of course, there is a call for transparency in society at large. Well, the battle of transparency vs. spin is a permanent feature of human existence but even there, we cannot ignore the possibilities of open data.
Databases and Servers
Technically, what does this take? Mostly, this takes a lot of memory. The software is there and we are productizing it as we speak. As with other data intensive things, the key is scalable querying over clusters of commodity servers. Nothing we have not heard before. Of course, the DBMS must know about RDF specifics to get the right query plans and so on but this we have explained elsewhere.
This all comes down to the cost of memory. No amount of CPU or network speed will make any difference if data is not in memory. Right now, a board with 8G and a dual core AMD X86-64 and 4 disks may cost about $700. 2 x 4 core Xeon and 16G and 8 disks may be $4000, counting just the components. In our experience, about 32G per billion triples is a minimum. This must be backed by a few independent disks so as to fill the cache in parallel. A cluster with 1 TB of RAM would be under $100K if built from low end boards.
The workload is all about large joins across partitions. The queries parallelize well, thus using the largest and most expensive machines for building blocks is not cost efficient. Having absolutely everything in RAM is also not cost efficient, but it is necessary to have many disks to absorb the random access load. Disk access is predominantly random, unlike some analytics workloads that can read serially. If SSD's get a bit cheaper, one could have SSD for the database and disk for backup.
With large data centers, redundancy becomes an issue. The most cost effective redundancy is simply storing partitions in duplicate or triplicate on different commodity servers. The DBMS software should handle the replication and fail-over.
For operating such systems, scaling-on-demand is necessary. Data must move between servers, and adding or replacing servers should be an on-the-fly operation. Also, since access is essentially never uniform, the most commonly accessed partitions may benefit from being kept in more copies than less frequently accessed ones. The DBMS must be essentially self administrating since these things are quite complex and easily intractable if one does not have in depth understanding of this rather complex field.
The best price point for hardware varies with time. Right now, the optimum is to have many basic motherboards with maximum memory in a rack unit, then another unit with local disks for each motherboard. Much cheaper than SAN's and Infiniband fabrics.
Conclusions and Next Steps
The ingredients and use cases are there. If server clusters with 1TB RAM begin under $100K, the cost of deployment is small compared to personnel costs.
Bootstrapping the DataSphere from current Linked Open Data, such as DBpedia, OpenCYC, Freebase, and every sort of social network, is feasible. Aside from private data integration and analytics efforts and E-science, the use cases are liberating social networks and C2C and some aspects of search from silos, overcoming spam, and mass use of semantics extracted from text. Emergent effects will then carry the ball to places we have not yet been.
The Linked Data Web has its origins in Semantic Web research, and many of the present participants come from these circles. Things may have been slowed down by a disconnect, only too typical of human activity, between Semantic Web research on one hand and database engineering on the other. Right now, the challenge is one of engineering. As documented on this blog, we have worked quite a bit on cluster databases, mostly but not exclusively with RDF use cases. The actual challenges of this are however not at all what is discussed in Semantic Web conferences. These have to do with complexities of parallelism, timing, message bottlenecks, transactions, and the like, i.e., hardcore engineering. These are difficult beyond what the casual onlooker might guess but not impossible. The details that remain to be worked out are nothing semantic, they are hardcore database, concerning automatic provisioning and such matters.
It is as if the Semantic Web people look with envy at the Web 2.0 side where there are big deployments in production, yet they do not seem quite ready to take the step themselves. Well, I will write some other time about research and engineering. For now, the message is &mdash go for it. Stay tuned for more announcements, as we near production with our next generation of software.
Related
|
03/25/2009 10:50 GMT-0500
|
Modified:
03/25/2009 12:31 GMT-0500
|
Beyond Applications - Introducing the Planetary Datasphere (Part 1)
This is the first in a short series of blog posts about what becomes possible when essentially unlimited linked data can be deployed on the open web and private intranets.
The term DataSphere comes from Dan Simmons' Hyperion science fiction series, where it is a sort of pervasive computing capability that plays host to all sorts of processes, including what people do on the net today, and then some. I use this term here in order to emphasize the blurring of silo and application boundaries. The network is not only the computer but also the database. I will look at what effects the birth of a sort of linked data stratum can have on end-user experience, application development, application deployment and hosting, business models and advertising, and security; how cloud computing fits in; and how back-end software such as databases must evolve to support all of these.
This is a mid-term vision. The components are coming into production as we speak, but the end result is not here quite yet.
I use the word DataSphere to refer to a worldwide database fabric, a global Distributed DBMS collective, within which there are many Data Spaces, or Named Data Spaces. A Data Space is essentially a person's or organization's contribution to the DataSphere. I use Linked Data Web to refer to component technologies and practices such as RDF, SPARQL, Linked Data practices, etc. The DataSphere does not have to be built on this technology stack per se, but this stack is still the best bet for it.
General
There exist applications for performing specialized functions such as social networking, shopping, document search, and C2C commerce at planetary scale. All these applications run on their own databases, each with a task specific schema. They communicate by web pages and by predefined messages for diverse application-specific transactions and reports.
These silos are scalable because in general their data has some natural partitioning, and because the set of transactions is predetermined and the data structure is set up for this.
The Linked Data Web proposes to create a data infrastructure that can hold anything, just like a network can transport anything. This is not a network with a memory of messages, but a whole that can answer arbitrary questions about what has been said. The prerequisite is that the questions are phrased in a vocabulary that is compatible with the vocabulary in which the statements themselves were made.
In this setting, the vocabulary takes the place of the application. Of course, there continues to be a procedural element to applications; this has the function of translating statements between the domain vocabulary and a user interface. Examples are data import from existing applications, running predefined reports, composing new reports, and translating between natural language and the domain vocabulary.
The big difference is that the database moves outside of the silo, at least in logical terms. The database will be like the network — horizontal and ubiquitous. The equivalent of TCP/IP will be the RDF/SPARQL combination. The equivalent of routing protocols between ISPs will be gateways between the specific DBMS engines supporting the services.
The place of the DBMS in the stack changes
The RDBMS in itself is eternal, or at least as eternal as a culture with heavy reliance on written records is. Any such culture will invent the RDBMS and use it where it best fits. We are not replacing this; we are building an abstracted worldwide data layer. This is to the RDBMS supporting line-of-business applications what the www was to enterprise content management systems.
For transactions, the Web 2.0-style application-specific messages are fine. Also, any transactional system that must be audited must physically reside somewhere, have physical security, etc. It can't just be somewhere in the DataSphere, managed by some system with which one has no contract, just like Google's web page cache can't be relied on as a permanent repository of web content.
Providing space on the Linked Data Web is like providing hosting on the Document Web. This may have varying service levels, pricing models, etc. The value of a queriable DataSphere is that a new application does not have to begin by building its own schema, database infrastructure, service hosting, etc. The application becomes more like a language meme, a cultural form of interaction mediated by a relatively lightweight user-facing component, laterally open for unforeseen interaction with other applications from other domains of discourse.
End User Benefits
For the end user, the web will still look like a place where one can shop, discuss, date, whatever. These activities will be mediated by user interfaces as they are now. Right now, the end user's web presence is his/her blog or web site, and their contributions to diverse wikis, social web sites, and so forth. These are scattered. The user's Data Space is the collection of all these things, now presented in a queriable form. The user's Data Space is the user's statement of presence, referencing the diverse contributions of the user on diverse sites.
The personal Data Space being a queriable, structured whole facilitates finding and being found, which is what brings individuals to the web in the first place. The best applications and sites are those which make this the easiest. The Linked Data Web allows saying what one wishes in a structured, queriable manner, across all application domains, independently of domain specific silos. The end user's interaction with the personal data space is through applications, like now. But these applications are just wrappers on top of self describing data, represented in domain specific vocabularies; one vocabulary is used for social networking, another for C2C commerce, and so on. The user is the master of their personal Data Space, free to take it where he or she wishes.
Further benefits will include more ready referencing between these spaces, more uniform identity management, cross-application operations, and the emergence of "meta-applications," i.e., unified interfaces for managing many related applications/tasks.
Of course, there is the increase in semantic richness, such as better contextuality derived from entity extraction from text. But this is also possible in a silo. The Linked Data Web angle is the sharing of identifiers for real world entities, which makes extracts of different sources by different parties potentially joinable. The user interaction will hardly ever be with the raw data. But the raw data being still at hand makes for better targeting of advertisements, better offering of related services, easier discovery of related content, and less noise overall.
Kingsley Idehen has coined the term SDQ, for Serendipitous Discovery Quotient, to denote this. When applications expose explicit semantics, constructing a user experience that combines relevant data from many sources, including applications as well as highly targeted advertising, becomes natural. It is no longer a matter of "mashing up" web service interfaces with procedural code, but of "meshing" data through declarative queries across application spaces.
Applications in the DataSphere
The workflows supported by the DataSphere are essentially those taking place on the web now. The DataSphere dimension is expressed by bookmarklets, browser plugins, and the like, with ready access to related data and actions that are relevant for this data. Actions triggered by data can be anything from posting a comment to making an e-commerce purchase. Web 2.0 models fit right in.
Web application development now consists of designing an application-specific database schema and writing web pages to interact with this schema. In the DataSphere, the database is abstracted away, as is a large part of the schema. The application floats on a sea of data instead of being tied to its own specific store and schema. Some local transaction data should still be handled in the old way, though.
For the application developer, the question becomes one of vocabulary choice. How will the application synthesize URIs from the user interaction? Which URIs will be used, since pretty much anything will in practice have many names (e.g., DBpedia Vs. Freebase identifiers). The end user will generally have no idea of this choice, nor of the various degrees of normalization, etc., in the vocabularies. Still, usage of such applications will produce data using some identifiers and vocabularies. Benefits of ready joining without translation will drive adoption. A vocabulary with instance data will get more instance data.
The Linked Data Web infrastructure itself must support vocabulary and identifier choice by answering questions about who uses a particular identifier and where. Even now, we offer entity ranks and resolution of synonyms, queries on what graphs mention a certain identifier and so on. This is a means of finding the most commonly used term for each situation. Convergence of terminology cuts down on translation and makes for easier and more efficient querying.
Advertising
The application developer is, for purposes of advertising, in the position of the inventory owner, just like a traditional publisher, whether web or other. But with smarter data, it is not a matter of static keywords but of the semantically explicit data behind each individual user impression driving the ads. Data itself carries no ads but the user impression will still go through a display layer that can show ads. If the application relies on reuse of licensed content, such as media, then the content provider may get a cut of the ad revenue even if it is not the direct owner of the inventory. The specifics of implementing and enforcing this are to be worked out.
Content Providers, License, and Attribution
For the content provider, the URI is the brand carrier. If the data is well linked and queriable, this will drive usage and traffic to the services of the content provider. This is true of any provider, whether a media publisher, e-commerce business, government agency, or anything else.
Intellectual property considerations will make the URI a first class citizen. Just like the URI is a part of the document web experience, it is a part of the Linked Data Web experience. Just like Creative Commons licenses allow the licensor to define what type of attribution is required, a data publisher can mandate that a user experience mediated by whatever application should expose the source as a dereferenceable URI.
One element of data dereferencing must be linking to applications that facilitate human interaction with the data. A generic data browser is a developer tool; the end user experience must still be mediated by interfaces tailored to the domain. This layer can take care of making the brand visible and can show advertising or be monetized on a usage basis.
Next we will look at the service provider and infrastructure side of this.
Related
|
03/24/2009 09:38 GMT-0500
|
Modified:
03/24/2009 10:50 GMT-0500
|
|
|