Ph.D. Thesis: Data Warehousing

Many companies are investing huge amounts of money to create an enterprise data warehouse. A data warehouse is a centralized database that stores data from often autonomous information sources distributed in an enterprise. For instance, WalMart's data warehouse may collect information from its regional inventory and sales databases. Once the data warehouse is built, it is used to answer analytical or decision support queries. For instance, the WalMart data warehouse may be used to answer queries such as: "Which stores and for what months was shampoo Flex in high demand but short in supply?"

While there are good commercial products that help create a data warehouse, it has been reported that most projects for creating a data warehouse fail. Furthermore, once the data warehouse is created, there are only a handful of products that help maintain a data warehouse.

The goal of my research has been to improve data warehousing technology by tackling problems that need solutions now or in the future.

Making View Maintenance Efficient

The first phase in creating or maintaining a data warehouse is to detect changes from the various distributed sources. For sources like commercial relational databases, this task is made easy with the use of triggers. However, many of the sources may be enterprise legacy systems or plain flat files. For such sources, we developed efficient snapshot differential algorithms to detect the changes.

Once the source changes are detected, the tables or materialized views at the data warehouse need to be maintained. In order to make this process more efficient, choosing the proper auxiliary structures (i.e., indices as well as other materialized views) is crucial. For this task, we developed various algorithms for choosing views and indices to materialize.

Even when the proper structures are chosen, pushing the changes up the view hierarchy of the warehouse is non-trivial. The standard view maintenance expressions can be used in a number of ways leading to different update windows. We developed algorithms for pushing the changes up a view hierarchy that result in update windows that are 6 times shorter than the update windows of conventional methods.

Recovery of the Warehouse Load

Even with the above techniques, the process of loading data into the warehouse (whether it is for maintainance or creation) is very time consuming. Moreover, this process is not devoid of failures. Hence, it is pertinent to make the warehouse load resumable . The warehouse load can be segmented into two phases. The first phase extracts data from the remote sources, and cleanses the data. Then there is the phase that creates or maintains the warehouse view hierarchy. For the first phase, we worked with Sagent to develop resumption algorithms. Hence, instead of redoing the failed warehouse load, only the remaining "parts" of the load are performed. Our algorithms can lead to 90% reduction in the recovery time. We are currently working on resumption algorithms for the second phase of the load.

Reducing the Storage Requirement of the Warehouse

Another emerging problem of data warehouses is the explosive growth of the warehouse materialized views. For instance, telecommunications companies can collect hundreds of gigabytes of call data alone. By expiring data that is no longer needed, storage cost as well as maintenance cost can be greatly reduced. The following paper discusses how expiration is done in the context of warehouse materialized views.

The WHIPS System Prototype

Many of the algorithms described above are incorporated in Stanford's Data Warehousing project called WHIPS. While the prototype has a lot of functionality, we still need to improve on its scalability. We are also working with Sagent, whose Data Mart product is one of the leading commercial data warehouse creation products, to develop and incorporate resumption technology.

Currently Interested In ...

There are many problems in data warehousing that need very good solutions. Here are some of the problems that I am currently working on.

Querying Internet Sources

While data warehousing has been my main research focus, I am also intrigued in the integration of heterogenous sources. In particular, one of the problems I foresee is mediating and answering queries over multiple Internet sources (e.g., various Internet fill-out forms) with various restrictions and capabilities. In the following paper, we describe how to obtain capability-sensitive query plans over Internet sources.

VLSI CAD

Before I got infatuated with databases, I worked on VLSI partitioning.