Many companies are investing huge amounts of money
to create an enterprise data warehouse.
A data warehouse is a centralized database that stores
data from often autonomous information sources
distributed in an enterprise.
For instance, WalMart's data warehouse may collect information
from its regional inventory and sales databases.
Once the data warehouse is built,
it is used to answer analytical or decision support queries.
For instance, the WalMart data warehouse may be used
to answer queries such as: "Which stores and for what months
was shampoo Flex in high demand but short in supply?"
While there are good commercial products that help create
a data warehouse, it has been reported that most projects
for creating a data warehouse fail. Furthermore, once the
data warehouse is created, there
are only a handful of products that help maintain
a data warehouse.
The goal of my research has been to improve data warehousing
technology by tackling problems that need solutions now
or in the future.
The first phase in creating or maintaining a data warehouse
is to detect changes from the various distributed sources.
For sources like commercial relational databases, this task
is made easy with the use of triggers. However, many of the
sources may be enterprise legacy systems or plain flat files.
For such sources, we developed efficient
snapshot differential algorithms to detect the changes.
Once the source changes are detected, the tables or materialized
views at the data warehouse need to be maintained.
In order to make this process more efficient, choosing
the proper auxiliary structures (i.e., indices as well as other
materialized views) is crucial. For this task,
we developed various algorithms for
choosing views and indices
to materialize.
Even when the proper structures are chosen, pushing the changes
up the view hierarchy of the warehouse is non-trivial.
The standard view maintenance expressions can be used in a number
of ways leading to different update windows.
We developed algorithms
for pushing the changes up a view hierarchy
that result in update windows that are 6 times shorter than
the update windows of conventional methods.
Even with the above techniques, the process of loading data
into the warehouse (whether it is for maintainance or
creation) is very time consuming.
Moreover, this process is not devoid of failures. Hence, it
is pertinent to make the warehouse load
resumable .
The warehouse load can be segmented into two phases.
The first phase extracts data from the remote sources,
and cleanses the data. Then there is the phase that creates or
maintains the warehouse view hierarchy. For the first phase, we worked
with Sagent
to develop
resumption algorithms. Hence, instead of redoing
the failed warehouse load, only the remaining "parts" of the load
are performed. Our algorithms can lead to 90% reduction
in the recovery time. We are currently working on resumption algorithms
for the second phase of the load.
Another emerging problem of data warehouses is the
explosive growth of the warehouse materialized views. For instance,
telecommunications companies can collect hundreds of gigabytes
of call data alone. By expiring data that is no
longer needed, storage cost as well as maintenance cost
can be greatly reduced. The following paper discusses how
expiration is done in the context of warehouse materialized views.
Many of the algorithms described above are incorporated in
Stanford's Data Warehousing project called
WHIPS.
While the prototype has a lot of functionality, we still need
to improve on its scalability. We are also working with
Sagent,
whose Data Mart product is one of the leading
commercial data warehouse creation products,
to develop and incorporate resumption technology.
There are many problems in data warehousing that need very good solutions. Here are some of the problems that I am currently working on.
While data warehousing has been my main research focus,
I am also intrigued in the integration of heterogenous sources.
In particular, one of the problems I foresee is mediating
and answering queries over
multiple Internet sources (e.g., various Internet fill-out forms)
with various restrictions and capabilities.
In the following paper, we describe how to obtain
capability-sensitive query plans over Internet sources.
Before I got infatuated with databases, I worked on
VLSI partitioning.