Details

Virtuoso Data Space Bot
Burlington, United States

Subscribe

Post Categories

Recent Articles

Display Settings

articles per page.
order.
Compare & Contrast: SQL Server's Linked Server vs Virtuoso's Virtual Database Layer

Microsoft SQL Server's Linked Server Promise

The ability to use distributed queries -- i.e., to issue SQL queries against any OLE-DB-accessible back end -- via Linked Servers.

The promise fails to materialize, primarily because while there are several ways of issuing such distributed queries, none of them work with all data access providers, and even for those that do, results received via different methods may differ.

Compounding the issue, there are specific configuration options which must be set correctly, often differing from defaults, to permit such things as "ad-hoc distributed queries".

Common tools that are typically used with such Linked Servers include SSIS and DTS. Such generic tools typically rely on four-part naming for their queries, expecting SQL Server to properly rewrite remotely executed queries for the DBMS engine which ultimately executes them.

The most common cause of failure is that when SQL Server rewrites a query, it typically does so using SQL-92 syntax, regardless of the back-end's abilities, and using the Transact-SQL dialect for implementation-specific query syntaxes, regardless of the back-end's dialect. This leads to problems especially when the Linked Server is an older variant which doesn't support SQL-92 (e.g., Progress 8.x or earlier, Informix 7 or earlier), or which SQL dialect differs substantially from Transact-SQL (e.g., Informix, Progress, MySQL, etc.).

Basic Four-Part Naming

SELECT *
  FROM linked_server.[catalog].[schema].object

Four-part naming presumes that you have pre-defined a Linked Server, and executes the query on SQL Server. SQL Server decides what if any sub- or partial-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.

OpenQuery

SELECT *
  FROM OPENQUERY ( linked_server , 'query' )

OpenQuery also presumes that you have pre-defined a Linked Server, but executes the query as a "pass-through", handing it directly to the remote provider. Features of the remote server and the data access provider may be taken advantage of, but only if the query author knows about them.

From the product docs:

SQL Server's Linked Server extension executes the specified pass-through query on the specified linked server. This server is an OLE DB data source. OPENQUERY can be referenced in the FROM clause of a query as if it were a table name. OPENQUERY can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement. This is subject to the capabilities of the OLE DB provider. Although the query may return multiple result sets, OPENQUERY returns only the first one.

...

OPENQUERY does not accept variables for its arguments. OPENQUERY cannot be used to execute extended stored procedures on a linked server. However, an extended stored procedure can be executed on a linked server by using a four-part name.

OpenRowset

SELECT *
  FROM OPENROWSET
    ( 'provider_name' ,
      'datasource' ; 'user_id' ; 'password',
      { [ catalog. ] [ schema. ] object | 'query' }
    )

OpenRowset does not require a pre-defined Linked Server, but does require the user to know what data access providers are available on the SQL Server host, and how to manually construct a valid connection string for the chosen provider. It does permit both "pass-through" and "local execution" queries, which can lead to confusion when the results differ (as they regularly will).

More from product docs:

Includes all connection information that is required to access remote data from an OLE DB data source. This method is an alternative to accessing tables in a linked server and is a one-time, ad hoc method of connecting and accessing remote data by using OLE DB. For more frequent references to OLE DB data sources, use linked servers instead. For more information, see Linking Servers. The OPENROWSET function can be referenced in the FROM clause of a query as if it were a table name. The OPENROWSET function can also be referenced as the target table of an INSERT, UPDATE, or DELETE statement, subject to the capabilities of the OLE DB provider. Although the query might return multiple result sets, OPENROWSET returns only the first one.

OPENROWSET also supports bulk operations through a built-in BULK provider that enables data from a file to be read and returned as a rowset.

...

OPENROWSET can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access. When accessing remote OLE DB data sources, the login identity of trusted connections is not automatically delegated from the server on which the client is connected to the server that is being queried. Authentication delegation must be configured. For more information, see Configuring Linked Servers for Delegation.

Catalog and schema names are required if the OLE DB provider supports multiple catalogs and schemas in the specified data source. Values for catalog and schema can be omitted when the OLE DB provider does not support them. If the provider supports only schema names, a two-part name of the form schema.object must be specified. If the provider supports only catalog names, a three-part name of the form catalog.schema.object must be specified. Three-part names must be specified for pass-through queries that use the SQL Server Native Client OLE DB provider. For more information, see Transact-SQL Syntax Conventions (Transact-SQL). OPENROWSET does not accept variables for its arguments.

OpenDataSource

SELECT *
  FROM OPENDATASOURCE
    ( 'provider_name',
      'provider_specific_datasource_specification'
    ).[catalog].[schema].object

As with basic four-part naming, OpenDataSource executes the query on SQL Server. SQL Server decides what if any sub-queries to execute on the linked server, tends not to use appropriate syntax for these, and usually does not take advantage of linked server or provider features.

Additional doc excerpts

Provides ad hoc connection information as part of a four-part object name without using a linked server name.

...

OPENDATASOURCE can be used to access remote data from OLE DB data sources only when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled. When these options are not set, the default behavior does not allow for ad hoc access.

The OPENDATASOURCE function can be used in the same Transact-SQL syntax locations as a linked-server name. Therefore, OPENDATASOURCE can be used as the first part of a four-part name that refers to a table or view name in a SELECT, INSERT, UPDATE, or DELETE statement, or to a remote stored procedure in an EXECUTE statement. When executing remote stored procedures, OPENDATASOURCE should refer to another instance of SQL Server. OPENDATASOURCE does not accept variables for its arguments.

Like the OPENROWSET function, OPENDATASOURCE should only reference OLE DB data sources that are accessed infrequently. Define a linked server for any data sources accessed more than several times. Neither OPENDATASOURCE nor OPENROWSET provide all the functionality of linked-server definitions, such as security management and the ability to query catalog information. All connection information, including passwords, must be provided every time that OPENDATASOURCE is called.

Virtuoso's Virtual Database Promise & Deliverables

The ability to link objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.

There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.

All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.

# PermaLink Comments [0]
02/12/2010 16:44 GMT-0500 Modified: 02/17/2010 11:21 GMT-0500
Compare & Contrast: Oracle Heterogeneous Services (HSODBC, DG4ODBC) vs Virtuoso's Virtual Database Layer

Oracle Gateway Promise

Ability to use distributed queries over a generic connectivity gateway (HSODBC, DG4ODBC) -- i.e., to issue SQL queries against any ODBC- or OLE-DB-accessible linked back end.

Reality

Promise fails to materialize for several reasons. Immediate limitations include:

  • All tables locked by a FOR UPDATE clause and all tables with LONG columns selected by the query must be located in the same external database.
  • Distributed queries cannot select user-defined types or object REF datatypes on remote tables.

In addition to the above, which apply to database-specific heterogeneous environments, the database-agnostic generic connectivity components have the following limitations:

  • A table including a BLOB column must have a separate column that serves as a primary key.
  • BLOB and CLOB data cannot be read by passthrough queries.
  • Updates or deletes that include unsupported functions within a WHERE clause are not allowed.
  • Generic Connectivity does not support stored procedures.
  • Generic Connectivity agents cannot participate in distributed transactions; they support single-site transactions only.
  • Generic Connectivity does not support multithreaded agents.
  • Updating LONG columns with bind variables is not supported.
  • Generic Connectivity does not support ROWIDs.

Compounding the issue, the HSODBC and DG4ODBC generic connectivity agents perform many of their functions by brute-force methods. Rather than interrogating the data access provider (whether ODBC or OLE DB) or DBMS to which they are connected, to learn their capabilities, many things are done by using the lowest possible function.

For instance, when a SELECT COUNT (*) FROM table@link is issued through Oracle SQL, the target DBMS doesn't simply perform a SELECT COUNT (*) FROM table. Rather, it performs a SELECT * FROM table which is used to inventory all columns in the table, and then performs and fully retrieves SELECT field FROM table into an internal temporary table, where it does the COUNT (*) itself, locally. Testing has confirmed this process to be the case despite Oracle documentation stating that target data sources must support COUNT (*) (among other functions).

Virtuoso's Virtual Database Comparison

The Virtuoso Universal Server will link/attach objects (tables, views, stored procedures) from any ODBC-accessible data source. This includes any JDBC-accessible data source, through the OpenLink ODBC Driver for JDBC Data Sources.

There are no limitations on the data types which can be queried or read, nor must the target DBMS have primary keys set on linked tables or views.

All linked objects may be used in single-site or distributed queries, and the user need not know anything about the actual data structure, including whether the objects being queried are remote or local to Virtuoso -- all objects are made to appear as part of a Virtuoso-local schema.

# PermaLink Comments [0]
02/12/2010 16:43 GMT-0500 Modified: 02/17/2010 11:21 GMT-0500
Virtuoso Cluster
Virtuoso Cluster

We often get questions on clustering support, especially around RDF, where databases quickly get rather large. So we will answer them here.

But first on some support technology. We have an entire new disk allocation and IO system. It is basically operational but needs some further tuning. It offers much better locality and much better sequential access speeds.

Specially for dealing with large RDF databases, we will introduce data compression. We have over the years looked at different key compression possibilities but have never been very excited by them since thy complicate random access to index pages and make for longer execution paths, require scraping data for one logical thing from many places, and so on. Anyway, now we will compress pages before writing them to disk, so the cache is in machine byte order and alignment and disk is compressed. Since multiple processors are commonplace on servers, they can well be used for compression, that being such a nicely local operation, all in cache and requiring no serialization with other things.

Of course, what was fixed length now becomes variable length, but if the compression ratio is fairly constant, we reserve space for the expected compressed size, and deal with the rare overflows separately. So no complicated shifting data around when something grows.

Once we are done with this, this could well be a separate intermediate release.

Now about clusters. We have for a long time had various plans for clusters but have not seen the immediate need for execution. With the rapid growth in the Linking Open Data movement and questions on web scale knowledge systems, it is time to get going.

How will it work? Virtuoso remains a generic DBMS, thus the clustering support is an across the board feature, not something for RDF only. So we can join Oracle, IBM DB2, and others at the multi-terabyte TPC races.

We introduce hash partitioning at the index level and allow for redundancy, where multiple nodes can serve the same partition, allowing for load balancing read and replacement of failing nodes and growth of cluster without interruption of service.

The SQL compiler, SPARQL, and database engine all stay the same. There is a little change in the SQL run time, not so different from what we do with remote databases at present in the context of our virtual database federation. There is a little extra complexity for distributed deadlock detection and sometimes multiple threads per transaction. We remember that one RPC round trip Is 3-4 index lookups, so we pipeline things so as to move requests in batches, a few dozen at a time.

The cluster support will be in the same executable and will be enabled by configuration file settings. Administration is limited to one node, but Web and SQL clients can connect to any node and see the same data. There is no balancing between storage and control nodes because clients can simply be allocated round robin for statistically even usage. In relational applications, as exemplified by TPC-C, if one partitions by fields with an application meaning (such as warehouse ID), and if clients have an affinity to a particular chunk of data, they will of course preferentially connect to nodes hosting this data. With RDF, such affinity is unlikely, so nodes are basically interchangeable.

In practice, we develop in June and July. Then we can rent a supercomputer maybe from Amazon EC2 and experiment away.

We should just come up with a name for this. Maybe something astronomical, like star cluster. Big, bright but in this case not far away.

# PermaLink Comments [1]
05/23/2007 10:09 GMT-0500 Modified: 08/21/2015 16:19 GMT-0500
Season's Greetings from Virtuoso Development
Season's Greetings from Virtuoso Development

It's been a long and very busy time since the last blog post.

Now and then, circumstances call for a return to the contemplation of first principles. I have lately beheld the Platonic ideal of database-ness and translated it into engineering elegance. No quest is static and no objective is permanently achieved.

Accordingly, I have redone all Virtuoso core engine structures for control of parallel execution. As we now routinely get multiple cores per chip, this is more important than before. Aside from dramatic improvements in multiprocessor performance, there is also quite a bit of optimization for basic relational operations.

Of course, this is not for the pure pleasure of geek-craft; it serves a very practical purpose. RDF opens a new database frontier, where these things make a significant difference. In application scenarios involving either federated/virtual database or running typical web applications, the core concurrency of the DBMS is not really the determining factor. However, with RDF, we get a small number of very large tables and most processing goes to these tables. This is also often so with business intelligence but it is still more so with RDF. Thus the parallelism within a single index becomes essential.

We have also made a point by point comparison of Virtuoso and Oracle 10g for basic relational operations. Oracle is very good, certainly in the basic relational operations like table scans and different kinds of joins. As a matter of principle, we will at the minimum match Oracle in all these things, in single and multiprocessor environments. The Virtuoso cut forthcoming in January will have all this inside. We are also considering making and publishing a basic RDBMS performance checklist, aimed at comparing specific aspects of relational engine performance. While the TPC tests give a good aggregate figure, it is sometimes interesting to look at a finer level of detail. We may not be allowed to give out numbers in all cases due to license terms but we can certainly make the test available and publish numbers for those who do not object to this.

Of course, RDF is the direct beneficiary of all these efforts, since RDF loading and querying basically rests on the performance of very relational things, such as diverse types of indices and joins.

More information will be forthcoming in January.

Merry Christmas and productive new year to all.

# PermaLink Comments [0]
01/09/2007 02:05 GMT-0500 Modified: 04/17/2008 21:04 GMT-0500
Recent Virtuoso Developments
Recent Virtuoso Developments

We have been extensively working on virtual database refinements. There are many SQL cost model adjustments to better model distributed queries and we now support direct access to Oracle and Informix statistics system tables. Thus, when you attach a table from one or the other, you automatically getup to date statistics. This helps Virtuoso optimize distributed queries. Also the documentation is updated as concerns these, with a new section on distributed query optimization.

On the applications side, we have been keeping up with the SIOC RDF ontology developments. All ODS applications now make their data available as SIOC graphs for download and SPARQL query access.

What is most exciting however is our advance in mapping relational data into RDF. We now have a mapping language that makes arbitrary legacy data in Virtuoso or elsewhere in the relational world RDF query-able. We will put out a white paper on this in a few days.

Also we have some innovations in mind for optimizing the physical storage of RDF triples. We keep experimenting, now with our sights set to the high end of triple storage, towards billion triple data sets. We are experimenting with a new more space efficient index structure for better working set behavior. Next week will yield the first results.

# PermaLink Comments [0]
01/09/2007 01:35 GMT-0500 Modified: 04/16/2008 16:53 GMT-0500
         
Powered by OpenLink Virtuoso Universal Server
Running on Linux platform
OpenLink Software 1998-2006