Stanford University
Stanford, CA 94305
(650) 723-0685
(650) 725-2588
hector@cs.stanford.edu
http://www-db.stanford.edu/people/hector.html
Department of Computer Science
Stanford University
Stanford, CA 94305
(650) 723-7690
(650) 725-2588
widom@cs.stanford.edu
http://www-db.stanford.edu/~widom/
Department of Computer Science
Stanford University
Stanford, CA 94305
(650) 723-6045
(650) 725-2588
motwani@cs.stanford.edu
http://theory.stanford.edu/people/rajeev/
publish subscribe
security and privacy
data mining
Our project consists of 4 inter-related research thrusts:
(a) How to perform traditional database and data mining operations on streams;
(b) How to generate streams, and how to present streams to users; (c) How to
route streams to distributed users with differing information needs; and (d)
How to ensure the security and privacy of streams.
and additional details, please visit the project web site at http://www-db.stanford.edu/datamotion/.
7 Reliable Stream Transmissions
In the DataMotion architecture, databases can be made available via a streams. Interested clients submit queries or subscription requests, and the database sever multicasts the data of interest over a stream. We have studied the design of such a reliable multicast facility over an unreliable multicast network. Our multicast facility has several interesting properties: it has different numbers of clients interested in each data packet, allowing us to tune our strategy for each data transmission; has recurring data items, so that missed data items can be rescheduled for later transmission; and allows the server to adjust the scheduler according to loss information. We exploit the properties of our system to extend traditional reliability techniques for our case, and use performance evaluation to highlight the resulting differences. We find that our reliability techniques can reduce the average client wait time by over thirty percent. For additional details, please see:
Lam, Wang; Garcia-Molina, Hector. Reliably Networking a Multicast Repository, 22nd Symposium on Reliable Distributed Systems (SRDS 2003, Florence, Italy, October 2003.
7 The Price of Validity in Dynamic Networks
A DataMotion system may have data distributed across thousands of participant hosts. In a very dynamic environment, short-lived hosts may be the norm rather than an exception. In recent years, researchers have investigated best-effort algorithms to efficiently process aggregate queries (e.g., sum, count, average, minimum and maximum) on such dynamic systems. Unfortunately, query semantics for best-effort algorithms are ill-defined, making it hard to reason about guarantees associated with the result returned. Instead, we have specified a correctness condition, single-site validity, with respect to which the above algorithms are best-effort. We have developed a class of algorithms that guarantee validity in very dynamic systems. Experiments on real-life and synthetic network topologies validate performance of our algorithms, revealing the hitherto unknown price of validity. For additional details, please see:
Mayank Bawa, Aristides Gionis, Hector Garcia-Molina, Rajeev Motwani. The Price of Validity in Dynamic Networks, ACM SIGMOD Conference, Maison de la Chimie, Paris, France, June 13-18, 2004.
7 Exploring DataMotion Privacy
We have started our security and privacy thrust for the DataMotion project by exploring current security and privacy schemes. In doing so, we evaluated two popular approaches: P3P and hippocratic databases. P3P is a set of standards that allow corporations to declare their privacy policies. Hippocratic Databases have been proposed to implement such policies within a corporation's datastore. From an end-user individual's point of view, both of these rest on an uncomfortable philosophy of trusting corporations to protect his/her privacy. Recent history chronicles several episodes when such trust has been willingly or accidentally violated by corporations facing bankruptcy courts, civil subpoenas or lucrative mergers. We contend that data management solutions for information privacy must restore
controls in the individual's hands. We have started to explore alternative privacy mechanisms that enable such control, but these mechanisms will require a radical re-think on modeling, release/acquisition, and management of personal data. We have sketched some of these ideas in the following "vision paper":
Gagan Aggarwal, Mayank Bawa, Prasanna Ganesan, Hector Garcia-Molina, Krishnaram Kenthapadi, Nina Mishra, Rajeev Motwani, Utkarsh Srivastava, Dilys Thomas, Jennifer Widom, Ying Xu. Vision Paper: Enabling Privacy for the Paranoids, International Conference on Very Large Databases (VLDB), Toronto, Canada, August 29, 2004.
7 Scale Free Aggregation in Sensor Networks
One particular type of DataMotion system is a sensor network, where sensors generate streams of instrument readings. One of the goals of the system is to aggregate the readings. Frequently, data collected from "nearby" sensors has a high degree of correlation. This induces opportunities for data aggregation, that are crucial given the severe energy constraints of the sensors. Thus it is very desirable to take advantage of data correlations in order to avoid transmitting redundancy. We have formalized a notion of such correlation, that can vary according to a parameter k. We have developed a very simple randomized algorithm for routing information on a grid of sensors in a way which promotes data aggregation. We have shown that this simple scheme is a constant factor approximation (in expectation) to the optimum aggregation tree simultaneously for all correlation parameters k. The key idea of our randomized analysis is to relate the expected collision time of random walks on the grid to scale free aggregation. For additional details, please see:
M. Enachescu, A. Goel, R. Govindan, and R. Motwani. Scale Free Aggregation in Sensor Networks. First International Workshop on Algorithmic Aspects of Wireless Sensor Networks (ALGOSENSOR), Turku, Finland, July 2004.
7 Characterizing Memory Requirements for DataMotion Node (DMN).
A DataMotion Node (DMN) processes queries against data streams. One of the fundamental questions for such query engines is the types of queries that can be supported with limited memory. To answer this question, we considered conjunctive queries with arithmetic comparisons and optional aggregation over multiple continuous data streams. We then specified an algorithm for determining whether or not any given query can be evaluated using a bounded amount of memory
for all possible instances of the data streams. When a query can be evaluated using bounded memory, we can produce an execution strategy based on constant-sized synopses of the data streams. When it cannot, we can produce a data stream scenario in which evaluating the query requires memory linear in the size of the streams. For additional details, please see:
Arasu, Arvind; Babcock, Brian; Babu, Shivnath; McAlister, Jon; Widom, Jennifer. Characterizing Memory Requirements for Queries over Continuous Data Streams, ACM Transactions on Database Systems, March, (2004).
7 Exploiting k-Constraints to Reduce Memory Overhead in a DataMotion Node (DMN)
Continuous queries executed at a DataMotion Node (DMN) often require significant run-time state over arbitrary data streams. However, streams may exhibit certain data or arrival patterns, or constraints, that can be detected and exploited to reduce state considerably without compromising correctness. Rather than requiring constraints to be satisfied precisely, which can be unrealistic in a DataMotion environment, we introduce k-constraints, where k is an adherence parameter specifying how closely a stream adheres to the constraint. (Smaller k's are closer to strict adherence and offer better memory reduction.) We have developed a query processing architecture, called k-Mon, that detects useful k-constraints automatically and exploits the constraints to reduce run-time state for a wide range of continuous queries. Our experimental results show dramatic state reduction, while only modest computational overhead is incurred for our constraint monitoring and query execution algorithms. For additional details, please see:
Babu, Shivnath; Srivastava, Utkarsh; Widom, Jennifer. Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams, to appear in ACM Transactions on Database Systems, Sept 2004.
7 Adaptive Ordering of Pipelined Stream Filters
We consider the problem of pipelined filters, where a continuous stream of elements is processed in a DMN by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We studied the problem of ordering the filters adaptively to minimize processing
cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) One very important feature of A-Greedy is that it monitors and responds to selectivities that are correlated across filters (i.e., that are nonindependent), which provides the strong quality guarantee but incurs run-time overhead. We have identified and studied a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity. We have also developed a suite of variants of A-Greedy that lie at different points on this tradeoff spectrum. We have implemented all our algorithms and have conducted a thorough performance evaluation. For additional details, please see:
Babu, Shivnath; Motwani, Rajeev; Munagala, Kamesh; Nishizawa, Itaru; Widom, Jennifer. Adaptive Ordering of Pipelined Stream Filters, Proc. of ACM Intl. Conference on Management of Data (SIGMOD), 2004.
7 Memory-Limited Execution of Windowed Stream Joins
A DataMotion node (DMN) can join two streams, often with a "sliding window" join operation where only recent (in the window) tuples are candidates for joining. We have studied the problem of computing approximate answers to continuous sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state. One approximation scenario is to provide a maximum subset of the result, with the objective of losing as few result
tuples as possible. An alternative scenario is to provide a random sample of the join result, e.g., if the output of the join is being aggregated. We have shown formally that neither approximation can be addressed effectively for a sliding-window join of arbitrary input streams. Previous work has addressed only the maximum-subset problem, and has implicitly used a frequency-based model of stream arrival. We have addressed the sampling problem for this model. More importantly, we have identified a broad class of applications for which an age-based model of stream arrival is more appropriate, and we have studied both approximation scenarios under this new model. Finally, for the case of multiple joins being executed with an overall memory constraint, we have developed an algorithm for memory allocation across the joins that optimizes a combined measure of approximation in all scenarios considered. All of our algorithms are implemented and we have conducted experiments to demonstrate their effectiveness. For additional details, please see:
Srivastava, Utkarsh; Widom, Jennifer. Memory-Limited Execution of Windowed Stream Joins, Proc. of VLDB 2004, Sep. 2004.
7 Our nation's security and well being depends on prompt reaction to threats and emergencies. A DataStream system will make it possible for decision makers to receive the information they need real-time.
7 Streams arise in many other domains (tracking physical packages, monitoring computer usage, sensor networks, etc.) and businesses and scientists will also benefit from an infrastructure that can give them timely information.
7 In building DataMotion, we are educating and training new scientists, with critical information-processing skills for the nation. Their expertise (as well as some of the technology we will develop) will be of use well beyond DataStreams, in any domain where information must be handled and analyzed effectively.