1. Query Rewriting using
Semistructured Views
We address the problem of query rewriting for TSL,
a language for querying semistructured data. We develop and present an
algorithm that, given a semistructured query q and a set of semistructured
views V, finds rewriting queries, i.e., queries that access the views and
produce the same result as q. Our algorithm is based on appropriately generalizing
containment mappings, the chase, and unification -- techniques that were
developed for structured, relational data. We also develop an algorithm
for equivalence checking of TSL queries. We show that the algorithm is
sound and complete for TSL, i.e., it always finds every TSL rewriting query
of q, and we discuss its complexity. We extend the rewriting algorithm
to use available structural constraints (such as DTDs) to find more opportunities
for query rewriting. We currently incorporate the algorithm in the TSIMMIS
system.
2. Expressive Capabilities
Description Languages and Query Rewriting Algorithms
Information integration systems have to cope with
a wide variety of different information sources, which support query interfaces
with very varied capabilities. To deal with this problem, the integration
systems need descriptions of the query capabilities of each source, i.e.,
the set of queries supported by each source. Moreover, the integration
systems need algorithms for deciding how a query can be answered given
the capabilities of the sources. Finally, they need to translate a query
into the format that the source understands. We present two languages suitable
for descriptions of query capabilities of sources and compare their expressive
power. We also use one of the languages to automatically derive the capabilities
description of the integration system itself, in terms of the capabilities
of the sources it integrates. We describe algorithms for deciding whether
a query "matches" the description and show their application to the problem
of translating user queries into source-specific queries and commands.
We propose new, improved algorithms for the problem of answering queries
using these descriptions. Finally, we identify an interesting class of
source capability descriptions, for which our algorithms are much more
efficient.
This paper extends the work in paper
no. 6.
3. Incremental Maintenance
of Semistructured Views over Semistructured Data
Semistructured data is not strictly typed like relational
or object-oriented data and may be irregular or incomplete. It often arises
in practice, e.g., when heterogeneous data sources are integrated or data
is taken from the World Wide Web. Views over semistructured data can be
used to filter the data and to restructure (or provide structure to) it.
To achieve fast query response time, these views are often materialized.
This paper studies incremental maintenance techniques for materialized
views over semistructured data. We use the graph-based data model OEM and
the query language Lorel, developed at Stanford, as the framework for our
work. We propose a new algorithm that produces a set of queries that compute
the changes to the view based upon a change to the source. We develop an
analytic cost model and compare the cost of executing our incremental maintenance
algorithm to that of recomputing the view. We show that for nearly all
types of database updates, it is more efficient to apply our incremental
maintenance algorithm to the view than to recompute the view from the database,
even when there are thousands of updates.
4. Using Knowledge
of Redundancy for Query Optimization in Mediators
Many autonomous and heterogeneous information sources
are becoming increasingly available to the user through the Internet --
especially through the World Wide Web. The integration of Internet sources
poses several challenges which have not been sufficiently addressed. In
particular, knowledge of redundancy can be used to reduce the number of
source accesses that have to be performed to retrieve the answer to the
user query. The system must avoid retrieving the same set of data from
multiple sites. Moreover, redundancy can be exploited to further increase
query performance and system availability by having the system collect
information from the ``cheapest'' available sources. We formulate this
problem as a scheduling problem with AND/OR precedence constraints and
discuss its complexity. Moreover, probabilistic information about source
overlap can help derive efficient query plans for delivering partial answers
to queries. We formulate the optimization framework and propose approximations
to the problem that make efficient use of the source overlap information
to provide suboptimal solutions.
5. Capability-based Mediation
in TSIMMIS
Conventional mediators focus their attention on
the contents of the sources and their relationship to the integrated views
provided to the users. They do not take into account the capabilities of
sources to answer queries. This may lead them to generate plans involving
source queries that cannot be answered by the sources. In the TSIMMIS system,
we have developed a capability-sensitive plan generation module that constructs
feasible plans for user queries in the presence of limited source capabilities.
6. Describing and Using
Query Capabilities of Heterogeneous Sources
Information integration systems have to cope with
the different and limited query interfaces of the underlying information
sources. First, the integration systems need descriptions of the query
capabilities of each source, i.e., the set of queries supported by each
source. Second, the integration systems need algorithms for deciding how
a query can be answered given the capabilities of the sources. Third, they
need to translate a query into the format that the source understands.
We present two languages suitable for descriptions of query capabilities
of sources and compare their expressive power. We also describe algorithms
for deciding whether a query "matches" the description and show their application
to the problem of translating user queries into source-specific queries
and commands. Finally, we propose new improved algorithms for the problem
of answering queries using these descriptions.
7. Template-Based Wrappers
in the TSIMMIS System
Wrappers provide access to heterogeneous information
sources by converting application queries into one or more source-specific
queries or commands. and transforming the source results into the application
data model. This paper describes the wrapper implementation toolkit we
have developed for wrappers. The wrapper implementor specifies a set of
templates written in a high level declarative language that describe the
queries accepted by the wrapper and the objects returned. If an application
query matches a template, an action associated with the template is executed
to provide the native query for the underlying source. A query can match
a template by being equivalent to it, or by being equivalent to a template
modulo a wrapper-generated filter query. Finally, the wrapper transforms
the answer into the representation that is used by the application. The
wrapper can be easily extended by adding templates when more functionality
is required.
8. Views for Semistructured
Data
Defining a view over a semistructured database introduces
many new problems. In this paper we propose a view specification language
and consider the problem of answering queries posed over views. The two
main approaches, query rewriting and view materialization, are outlined
with focus on the difficult problems caused by the semistructured nature
of the data.
9. The TSIMMIS approach
to mediation: Data models and Languages
TSIMMIS -- The Stanford-IBM Manager of Multiple
Information Sources -- is a system for integrating information. It offers
a data model and a common query language that are designed to support the
combining of information from many different sources. It also offers tools
for generating automatically the components that are needed to build systems
for integrating information. In this paper we shall discuss the principal
architectural features and their rationale.
10. An Analysis Of Factors Directing
The Admission Process Of AI Technologies
We start with the observation that, while artificial
intelligence is a prominent branch of computer science that has even found
its way into the popular culture, it appears that its fruits are an inconsequential
part of the computer industry. The paper tries to assess the extent to
which this observation is accurate, by determining and analyzing some important
factors that affect the fortunes of AI technology in the marketplace. It
identifies the stages in the admission process of Information Technology
and the particular obstacles that AI is facing during that process. Finally,
based upon the patterns that the research reveals, we make some predictions
about the commercialization of AI in the future. Some of the observations
made in this paper in 1995 have since been echoed by popular business magazines
(see Forbes,
Nov 30, 1998, "AI gets real").
11. Buffer size and congestion
control in ATM networks
An important problem in ATM networks is to determine
the optimal size of buffers in ATM switches such that traffic and quality
of service requirements are met. I developed a simple scheduling scheme
that, for a simple traffic model, under very general conditions and for
a variety of common network topologies, required much smaller buffers in
the worst case than other known approaches.
If you have comments or questions, please contact me at vassalos@db.stanford.edu