IDM Workshop Template

DataMotion ― Dealing with Fast-Moving Data

Project Award Number IIS-0324431

Principal Investigator

Hector Garcia-Molina
Department of Computer Science

Stanford University

Stanford, CA 94305

(650) 723-0685

(650) 725-2588

hector@cs.stanford.edu

http://www-db.stanford.edu/people/hector.html

Co-PI

Jennifer Widom

Department of Computer Science

Stanford University

Stanford, CA 94305

(650) 723-7690

(650) 725-2588

widom@cs.stanford.edu

http://www-db.stanford.edu/~widom/

Co-PI

Rajeev Motwani

Department of Computer Science

Stanford University

Stanford, CA 94305

(650) 723-6045

(650) 725-2588

motwani@cs.stanford.edu

http://theory.stanford.edu/people/rajeev/

Keywords

data streams

publish subscribe

security and privacy

data mining

Project Summary

The DataMotion Project is building a new infrastructure for managing and analyzing large volumes of dynamic and diverse data. Today, most information systems ― even those handling multiple, distributed, heterogeneous data sources ― are based on stored and relatively static data sets. However, much of the most critical information today is highly dynamic, coming instead in the form of multiple, rapid, time-varying data streams. The volume is so high that it is difficult to store all the information in conventional databases. A much better approach is to route the stream to the interested users, while in the process filtering according to user's interests, combining the stream with other relevant streams, and performing real-time analysis of the data whenever possible. DataMotion enables such distributed, real-time processing.

Our project consists of 4 inter-related research thrusts: (a) How to perform traditional database and data mining operations on streams; (b) How to generate streams, and how to present streams to users; (c) How to route streams to distributed users with differing information needs; and (d) How to ensure the security and privacy of streams.

Publications and Products

The DataMotion Project started in October 2003. In this initial year, we started exploring a variety of topics in our four thrust areas, Input/Output of streams, DataMotion Node (DMN), Optimization, and Security & Privacy. In what follows we provide some highlights of our research. For a complete list of publications

and additional details, please visit the project web site at http://www-db.stanford.edu/datamotion/.

7 Reliable Stream Transmissions

In the DataMotion architecture, databases can be made available via a streams. Interested clients submit queries or subscription requests, and the database sever multicasts the data of interest over a stream. We have studied the design of such a reliable multicast facility over an unreliable multicast network. Our multicast facility has several interesting properties: it has different numbers of clients interested in each data packet, allowing us to tune our strategy for each data transmission; has recurring data items, so that missed data items can be rescheduled for later transmission; and allows the server to adjust the scheduler according to loss information. We exploit the properties of our system to extend traditional reliability techniques for our case, and use performance evaluation to highlight the resulting differences. We find that our reliability techniques can reduce the average client wait time by over thirty percent. For additional details, please see:

Lam, Wang; Garcia-Molina, Hector. Reliably Networking a Multicast Repository, 22nd Symposium on Reliable Distributed Systems (SRDS 2003, Florence, Italy, October 2003.

7 The Price of Validity in Dynamic Networks

A DataMotion system may have data distributed across thousands of participant hosts. In a very dynamic environment, short-lived hosts may be the norm rather than an exception. In recent years, researchers have investigated best-effort algorithms to efficiently process aggregate queries (e.g., sum, count, average, minimum and maximum) on such dynamic systems. Unfortunately, query semantics for best-effort algorithms are ill-defined, making it hard to reason about guarantees associated with the result returned. Instead, we have specified a correctness condition, single-site validity, with respect to which the above algorithms are best-effort. We have developed a class of algorithms that guarantee validity in very dynamic systems. Experiments on real-life and synthetic network topologies validate performance of our algorithms, revealing the hitherto unknown price of validity. For additional details, please see:

Mayank Bawa, Aristides Gionis, Hector Garcia-Molina, Rajeev Motwani. The Price of Validity in Dynamic Networks, ACM SIGMOD Conference, Maison de la Chimie, Paris, France, June 13-18, 2004.

7 Exploring DataMotion Privacy

We have started our security and privacy thrust for the DataMotion project by exploring current security and privacy schemes. In doing so, we evaluated two popular approaches: P3P and hippocratic databases. P3P is a set of standards that allow corporations to declare their privacy policies. Hippocratic Databases have been proposed to implement such policies within a corporation's datastore. From an end-user individual's point of view, both of these rest on an uncomfortable philosophy of trusting corporations to protect his/her privacy. Recent history chronicles several episodes when such trust has been willingly or accidentally violated by corporations facing bankruptcy courts, civil subpoenas or lucrative mergers. We contend that data management solutions for information privacy must restore

controls in the individual's hands. We have started to explore alternative privacy mechanisms that enable such control, but these mechanisms will require a radical re-think on modeling, release/acquisition, and management of personal data. We have sketched some of these ideas in the following "vision paper":

Gagan Aggarwal, Mayank Bawa, Prasanna Ganesan, Hector Garcia-Molina, Krishnaram Kenthapadi, Nina Mishra, Rajeev Motwani, Utkarsh Srivastava, Dilys Thomas, Jennifer Widom, Ying Xu. Vision Paper: Enabling Privacy for the Paranoids, International Conference on Very Large Databases (VLDB), Toronto, Canada, August 29, 2004.

7 Scale Free Aggregation in Sensor Networks

One particular type of DataMotion system is a sensor network, where sensors generate streams of instrument readings. One of the goals of the system is to aggregate the readings. Frequently, data collected from "nearby" sensors has a high degree of correlation. This induces opportunities for data aggregation, that are crucial given the severe energy constraints of the sensors. Thus it is very desirable to take advantage of data correlations in order to avoid transmitting redundancy. We have formalized a notion of such correlation, that can vary according to a parameter k. We have developed a very simple randomized algorithm for routing information on a grid of sensors in a way which promotes data aggregation. We have shown that this simple scheme is a constant factor approximation (in expectation) to the optimum aggregation tree simultaneously for all correlation parameters k. The key idea of our randomized analysis is to relate the expected collision time of random walks on the grid to scale free aggregation. For additional details, please see:

M. Enachescu, A. Goel, R. Govindan, and R. Motwani. Scale Free Aggregation in Sensor Networks. First International Workshop on Algorithmic Aspects of Wireless Sensor Networks (ALGOSENSOR), Turku, Finland, July 2004.

7 Characterizing Memory Requirements for DataMotion Node (DMN).

A DataMotion Node (DMN) processes queries against data streams. One of the fundamental questions for such query engines is the types of queries that can be supported with limited memory. To answer this question, we considered conjunctive queries with arithmetic comparisons and optional aggregation over multiple continuous data streams. We then specified an algorithm for determining whether or not any given query can be evaluated using a bounded amount of memory

for all possible instances of the data streams. When a query can be evaluated using bounded memory, we can produce an execution strategy based on constant-sized synopses of the data streams. When it cannot, we can produce a data stream scenario in which evaluating the query requires memory linear in the size of the streams. For additional details, please see:

Arasu, Arvind; Babcock, Brian; Babu, Shivnath; McAlister, Jon; Widom, Jennifer. Characterizing Memory Requirements for Queries over Continuous Data Streams, ACM Transactions on Database Systems, March, (2004).

7 Exploiting k-Constraints to Reduce Memory Overhead in a DataMotion Node (DMN)

Continuous queries executed at a DataMotion Node (DMN) often require significant run-time state over arbitrary data streams. However, streams may exhibit certain data or arrival patterns, or constraints, that can be detected and exploited to reduce state considerably without compromising correctness. Rather than requiring constraints to be satisfied precisely, which can be unrealistic in a DataMotion environment, we introduce k-constraints, where k is an adherence parameter specifying how closely a stream adheres to the constraint. (Smaller k's are closer to strict adherence and offer better memory reduction.) We have developed a query processing architecture, called k-Mon, that detects useful k-constraints automatically and exploits the constraints to reduce run-time state for a wide range of continuous queries. Our experimental results show dramatic state reduction, while only modest computational overhead is incurred for our constraint monitoring and query execution algorithms. For additional details, please see:

Babu, Shivnath; Srivastava, Utkarsh; Widom, Jennifer. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams, to appear in ACM Transactions on Database Systems, Sept 2004.

7 Adaptive Ordering of Pipelined Stream Filters

We consider the problem of pipelined filters, where a continuous stream of elements is processed in a DMN by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We studied the problem of ordering the filters adaptively to minimize processing

cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) One very important feature of A-Greedy is that it monitors and responds to selectivities that are correlated across filters (i.e., that are nonindependent), which provides the strong quality guarantee but incurs run-time overhead. We have identified and studied a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity. We have also developed a suite of variants of A-Greedy that lie at different points on this tradeoff spectrum. We have implemented all our algorithms and have conducted a thorough performance evaluation. For additional details, please see:

Babu, Shivnath; Motwani, Rajeev; Munagala, Kamesh; Nishizawa, Itaru; Widom, Jennifer. Adaptive Ordering of Pipelined Stream Filters, Proc. of ACM Intl. Conference on Management of Data (SIGMOD), 2004.

7 Memory-Limited Execution of Windowed Stream Joins

A DataMotion node (DMN) can join two streams, often with a "sliding window" join operation where only recent (in the window) tuples are candidates for joining. We have studied the problem of computing approximate answers to continuous sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state. One approximation scenario is to provide a maximum subset of the result, with the objective of losing as few result

tuples as possible. An alternative scenario is to provide a random sample of the join result, e.g., if the output of the join is being aggregated. We have shown formally that neither approximation can be addressed effectively for a sliding-window join of arbitrary input streams. Previous work has addressed only the maximum-subset problem, and has implicitly used a frequency-based model of stream arrival. We have addressed the sampling problem for this model. More importantly, we have identified a broad class of applications for which an age-based model of stream arrival is more appropriate, and we have studied both approximation scenarios under this new model. Finally, for the case of multiple joins being executed with an overall memory constraint, we have developed an algorithm for memory allocation across the joins that optimizes a combined measure of approximation in all scenarios considered. All of our algorithms are implemented and we have conducted experiments to demonstrate their effectiveness. For additional details, please see:

Srivastava, Utkarsh; Widom, Jennifer. Memory-Limited Execution of Windowed Stream Joins, Proc. of VLDB 2004, Sep. 2004.

Project Impact

While the challenges we are addressing are difficult and open-ended, the potential impact of this research is high:

7 Our nation's security and well being depends on prompt reaction to threats and emergencies. A DataStream system will make it possible for decision makers to receive the information they need real-time.

7 Streams arise in many other domains (tracking physical packages, monitoring computer usage, sensor networks, etc.) and businesses and scientists will also benefit from an infrastructure that can give them timely information.

7 In building DataMotion, we are educating and training new scientists, with critical information-processing skills for the nation. Their expertise (as well as some of the technology we will develop) will be of use well beyond DataStreams, in any domain where information must be handled and analyzed effectively.

Project Websites

DataMotion Project Page