A foundation for integrating heterogeneous data sources
We study the foundations of the integration issues that arise in a federation of heterogeneous data sources, possibly storing related information. Some of the notable features of our approach, motivated by the shortcomings of existing technology, include (a) the ability to share data across multiple...
Main Author: | |
---|---|
Format: | Others |
Published: |
1997
|
Online Access: | http://spectrum.library.concordia.ca/349/1/NQ25935.pdf Subramanian, Narayana Iyer <http://spectrum.library.concordia.ca/view/creators/Subramanian=3ANarayana_Iyer=3A=3A.html> (1997) A foundation for integrating heterogeneous data sources. PhD thesis, Concordia University. |
Summary: | We study the foundations of the integration issues that arise in a federation of heterogeneous data sources, possibly storing related information. Some of the notable features of our approach, motivated by the shortcomings of existing technology, include (a) the ability to share data across multiple heterogeneous data sources, (b) the ability to manipulate the meta-data (schema) component of a data source in the same vein as data can be manipulated, and (c) the ability to query besides well-structured data sources (such as relational databases), semi-structured data sources (such as the HTML documents on the World Wide Web). Our approach is declarative and is based on a simple logic called SchemaLog. SchemaLog's syntax is higher-order but it enjoys a first-order semantics. We present a formal account of the semantics of SchemaLog by developing a model theory, a proof theory, and a fixpoint theory. SchemaLog can be implemented on top of existing database systems in a 'non-intrusive' way. Realizing an efficient implementation of a SchemaLog-based system warrants the study of the calculus and algebraic languages underlying SchemaLog. We develop a new algebra by extending the conventional relational algebra with some new operations that are capable of manipulating both data and schema information in a federation of databases. We also develop a calculus language inspired by SchemaLog. Based on the calculus language, we study varying notions of safety that naturally arise in a federation scenario. One of our primary concerns in this dissertation has been the practical relevance and industrial impact of our contributions. In this vein and inspired by the SchemaLog experience, we develop a principled extension of SQL, called SchemaSQL. SchemaSQL is downward compatible with SQL syntax and semantics and is capable of (a) representing data in a database, in a structure substantially different from the original database, in which data and meta-data may be interchanged, (b) creating views whose schema is dynamically dependent on the input database, (c) expressing novel aggregation (over rows, and in general blocks of information) operations, in the spirit of some of the functionalities needed in OLAP applications, and (d) providing a great facility for interoperability and data, meta-data management in multidatabase systems. Legacy as well as non-traditional information systems constitute an important fragment of the data sources available in real-life. We demonstrate that SchemaLog can be naturally extended to support non-relational systems as well. In particular, we address the fundamental problem of retrieving specific information of interest to the user, from the enormous number of resources that are available on the Web. With this in mind, and inspired by SchemaLog, we develop a simple logic called WebLog and illustrate the simplicity and power of WebLog for Web querying and restructuring using a variety of applications involving real-life information in the Web. (Abstract shortened by UMI.) |
---|