Thesis Abstract

A wide variety of information sources are available both on the internal networks of organizations and on the Web. These sources are autonomous, have different and limited query capabilities, and usually contain heterogeneous, semistructured data (e.g., XML, bibliographic or genomic data). My thesis focuses on how to query these sources in an integrated way that gives the user the impression of querying a single source. I follow a principled approach, based on describing declaratively the contents and query capabilities of each information source. The fundamental question in this framework is how we can answer a query Q that does not mention the sources, using a program, a mediator, that can interact with the sources and can perform some local processing. I  have designed algorithms to answer this question for query and capability-description languages with varying expressive power, in both the relational and a semistructured data model. I also developed the theoretical framework that is necessary to answer the question for a semistructured language. 

To allow the mediator easy interaction with different kinds of sources, the low-level details of the interaction are handled by a software module called a wrapper. The thesis describes the design of an implementation toolkit I have developed for wrappers. 

The framework and algorithms presented in my thesis are part of TSIMMIS, a research prototype that allows integrated querying of semistructured files, Web-based sources and legacy systems.



Last modified: Mon Dec 7 18:22:40 PST 1998