Impact:
-------
YEAR 2 UPDATE
-------------
Broader Impact:
---------------
We conducted two technology transfer workshops.
As mentioned earlier, this type of event helps disseminate
our results, and at the same time, helps
us learn about relevant problems in industry.
(1) We organized a workshop with IBM Alamden (January 26, 2005),
focusing on the topic of privacy and security (one of the four
DataMotion thrusts).
(2) We organized a workshop at Google (April 28, 2005),
where we reported ongoing work in our group.
Research:
---------
During our second year, we have continued our DataMotion research,
in the four thrust areas. In what follows we describe
our results.
** Deciding Placement of Operators in A DataMotion Network
In sensor networks, data acquisition frequently takes place at
low-capability devices. The acquired data is then transmitted through
a hierarchy of nodes having progressively increasing network bandwidth
and computational power. We have considered the problem of executing queries
over these data streams, posed at the root of the hierarchy. To
minimize data transmission, it is desirable to perform ``in-network''
query processing: do some part of the work at intermediate nodes as
the data travels to the root. Most previous work on in-network query
processing has focused on aggregation and inexpensive filters. In our work,
we address in-network processing for queries involving possibly
expensive conjunctive filters, and joins. We considered the problem of
placing operators along the nodes of the hierarchy so that the overall
cost of computation and data transmission is minimized. We showed that
the problem is tractable, giving an optimal algorithm, and demonstrating
that a simpler greedy operator placement algorithm can fail to find
the optimal solution. Finally we defined a number of interesting
variations of the basic operator placement problem and demonstrated
their hardness.
Srivastava, Utkarsh; Munagala Kamesh; Widom, Jennifer. Operator
Placement for In-Network Stream Query Processing, Proc. of PODS 2005,
June 2005.
** Adaptive Caching for Continuous Queries
We have addressed the problem of executing continuous multiway join
queries in unpredictable and volatile environments. Our query class
captures windowed join queries in data stream systems as well as
conventional maintenance of materialized join views. Our adaptive
approach handles streams of updates whose rates and data
characteristics may change over time, as well as changes in system
conditions such as memory availability. We focused specifically on
the problem of adaptive placement and removal of caches to optimize
join performance. Our approach automatically considers conventional
tree-shaped join plans with materialized subresults at every
intermediate node, subresult-free MJoins, and the entire spectrum
between them. We have developed a family of algorithms for selecting
candidate caches, monitoring their cost and benefits in current
conditions, allocating memory to caches, and adapting as conditions
change. All of our algorithms were implemented in the STREAM prototype
Data Stream Management System and a thorough experimental evaluation
was performed.
Babu, Shivnath; Munagala, Kamesh; Widom, Jennifer; Motwani, Rajeev.
Adaptive Caching for Continuous Queries. In Proc. of ICDE 2005,
April 2005
** The Pipelined Set Cover Problem
The pipelined filters problem in databases and data stream management
systems is the problem of finding the optimal ordering of a set of
dependent selection or join operators in a query processor. This
problem can be abstracted as a generalization of the classical set
cover problem. In this generalization, called pipelined set cover,
the sets are applied sequentially to the universe and the covered
elements are discarded. The cost of including a set in pipelined set
cover is directly proportional to the number of uncovered elements at
the point of applying the set. The cost therefore depends on the
ordering of the sets, in addition to the sets themselves. We have
shown that several natural heuristics for this NP-hard problem (such
as greedy set cover and local search) can be analyzed using a common
linear-programming framework. This approach bounds the approximation
ratio as well as the running time of the corresponding algorithms. In
particular, we showed that the greedy and local search algorithms are
4-approximations for both uniform and nonuniform processing costs,
which is in contrast to the logarithmic performance guarantees
achievable for classical set cover. We have extended our analysis to
minimize the Lp-norm of the costs paid by the sets, where p >= 2 is an
integer, to examine the improvement in performance when the total cost
has increasing contribution from the initial sets in the
pipeline. Using a novel Lagrangian-relaxation analysis, we showed that
the greedy algorithm is a 9^(1/p)-approximation for the uniform
processing cost model, while the local search algorithm is a
4^(1/p)-approximation when the costs are nonuniform. We also
addressed the online version of pipelined set cover and developed a
competitive algorithm with a logarithmic performance guarantee.
Munagala, Kamesh; Babu; Shivnath; Motwani, Rajeev; Widom,
Jennifer. The Pipelined Set Cover Problem. In Proc. of ICDT, Jan. 2005
** Survey on Adaptive Query Processing
A great deal of work on adaptive query processing has been done over
the last few years: Adaptive query processing has been used to detect
and correct optimizer errors due to incorrect statistics or simplified
cost metrics; it has been applied to long-running continuous queries
over data streams whose characteristics change over time; and
routing-based adaptive query processing does away with the optimizer
altogether. Despite this large body of interrelated work, no unifying
comparison of adaptive query processing techniques or systems has been
attempted; we have tackled this problem. Our studies have identified
three families of systems (plan-based, CQ-based, and routing-based),
and we have compared them in detail with respect to the most important
aspects of adaptive query processing: plan quality, statistics
monitoring and re-optimization, plan migration, and scalability. We
have proposed two new approaches to adaptive query processing that
address some of the shortcomings revealed by our in-depth analysis:
(1) "Proactive re-optimization," in which the optimizer chooses
initial query plans with the expectation of re-optimization; and (2)
"Plan logging," in which optimizer decisions under different
conditions are logged over time, enabling plan re-use as well as
analysis of relevant statistics and benefits of adaptivity.
Babu, Shivnath; Bizarro, Pedro. Adaptive Query Processing in the
Looking Glass. In Proc. of the Second Biennial Conference on
Innovative Data Systems Research (CIDR), Jan. 2005
** Approximate Counts and Quantiles over Sliding Windows
We considered the problem of maintaining two important statistics -
approximate frequency counts and quantiles - over stream sliding
windows using limited memory. Frequency counts are used to compute
frequent elements in a dataset and quantiles to understand the
distribution of values. For many practical stream rates and window
sizes, it is infeasible to store the entire contents of the current
window, so computing approximate statistics using limited memory is
an important problem.
We developed various deterministic and randomized algorithms for this
problem. All of our algorithms require space that is logarithmic or
less in the window size. Further, the algorithms admit an error
parameter that can be used to control the approximation. The statistics
maintained by the algorithms can be made more accurate by using a
smaller error parameter, but this increases the space required by the
algorithm.
Arvind Arasu, Gurmeet Singh Manku. Approximate Counts and Quantiles
over Sliding Windows. PODS 2004. June 2004.
** Resource Sharing in Continuous Sliding Window Aggregates
We considered the problem of resource sharing when processing large
numbers of continuous queries. We specifically addressed sliding-window
aggregates over data streams, an important class of continuous
operators for which sharing has not been addressed. We developed a
suite of sharing techniques that cover a wide range of possible
scenarios: different classes of aggregation functions (algebraic,
distributive, holistic), different window types (time-based,
tuple-based, suffix, historical), and different input models (single
stream, multiple substreams). We developed precise theoretical
performance guarantees for our techniques, and showed their practical
effectiveness through a thorough experimental study.
Arvind Arasu, Jennifer Widom. Resource Sharing in Continuous Sliding
Window Aggregates. VLDB 2004. Sept. 2004
** Caching Queues in Memory Buffers.
Motivated by the need for maintaining multiple, large queues of data in
modern high-performance systems, we study the problem of caching queues
in memory under the following simple, but widely applicable, model.
At each clock-tick, any number of data items may enter
the various queues, while data-items are consumed from the heads of the
queues. Since the number of unconsumed items may exceed memory buffer size,
some items in the queues need to be spilled to secondary storage and later
moved back into memory for consumption. We provide online queue-caching
algorithms under a number of interesting cost models.
R. Motwani, D. Thomas. Caching Queues in Memory Buffers. SODA 2004
** Operator Scheduling in Data Stream Systems.
B.Babcock, S. Babu, M. Datar, R. Motwani, D. Thomas. VLDB Journal 2004
In many applications involving continuous data streams, data arrival
is bursty and data rate fluctuates over time. Systems that seek to
give rapid or real-time query responses in such an environment must
be prepared to deal gracefully with bursts in data arrival without
compromising system performance. We studied one strategy for
processing bursty streams – adaptive, load-aware scheduling of query
operators to minimize resource consumption during times of peak load.
We showed that the choice of an operator scheduling strategy can have
significant impact on the runtime system memory usage as well as
output latency. Our aim was to design a scheduling strategy that
minimizes the maximum runtime system memory while maintaining the
output latency within prespecified bounds. We first developed Chain
scheduling, an operator scheduling strategy for data stream systems
that is near-optimal in minimizing runtime memory usage for any
collection of single-stream queries involving selections,
projections, and foreign-key joins with stored relations. Chain
scheduling also performs well for queries with sliding-window joins
over multiple streams and multiple queries of the above types.
However, during bursts in input streams, when there is a buildup of
unprocessed tuples, Chain scheduling may lead to high output latency.
We studied the online problem of minimizing maximum runtime memory,
subject to a constraint on maximum latency. We have derived
preliminary observations, negative results, and heuristics for this
problem. A thorough experimental evaluation has been conducted where
we demonstrate the potential benefits of Chain scheduling and its
different variants, compare it with competing scheduling strategies,
and validate our analytical conclusions.
B.Babcock, S. Babu, M. Datar, R. Motwani, D. Thomas.
Operator Scheduling in Data Stream Systems. VLDB Journal 2004
** A Distributed Architecture for Secure Database Services
Recent trends towards outsourcing has led to interest in the
database-as-a-service model where an organization uses an external
service provider to store and manage its data. However, concerns,
legal and otherwise, over data privacy has led to the requirement
that the service provider not be able to obtain any private
information despite it storing all the organization's data. The
traditional approach to building such a "secure" database service has
been to make the organization encrypt its data before storing it in
the untrusted external database. Queries to the database are executed
over the encrypted data, and the results are brought over to the
"client" side where they are decrypted. The major difficulty with
this approach is that it is very hard to efficiently execute queries
over encrypted data and, often, one is left with no alternative but
to transport a large fraction of the database back to the client,
decrypt it on the client side, and then execute the query against
this decrypted database. Such an approach is clearly very expensive
and eliminates the potential advantages of data outsourcing.
We developed a new, distributed architecture that allows an
organization to outsource its data management to two untrusted but
non-colluding servers while preserving data privacy. The presence of
two servers enables efficient partitioning of data so that the
contents at any one server are guaranteed not to breach data privacy.
For example, it may be a breach of privacy to reveal the name of any
customer together with his/her social security number (SSN). However,
if we were to store customer names at one site, and SSNs at another,
neither site is capable of breaching privacy merely from the data
available to it. Moreover, since both the names nor the SSNs are
stored in the clear without any encryption, it is possible to
efficiently execute queries using distributed query processing
techniques. We studied the different challenges involved in enabling
such a distributed architecture for secure database services,
including problems of distributed-schema design, query processing,
optimization and execution.
Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani,
Srivastava, Thomas, and Xu, Two Can Keep a Secret: A Distributed
Architecture for Secure Database Services. Proceedings of the Second
Biennial Conference on Innovative Data Systems Research (CIDR), pages
186--199, Asilomar, CA, January 2005.
--------------------------------------
Training:
YEAR 2 UPDATE
-------------
Two students completed their PhD Thesis in the DataMotion
area:
** Adaptive Processing in Data Stream Management Systems
Arvind Arasu, PhD Thesis
Many modern applications deal with data that arrives continuously and
must be processed in real-time: network performance monitoring,
financial analysis over stock tickers, sensor data processing, and
others. Data Stream Management Systems (DSMSs) have recently been
developed for this class of applications. DSMSs support continuous
long-running queries, so they face a fundamental challenge when
stream data and arrival characteristics or system conditions may vary
over time. To address this problem, we developed adaptive approaches
to DSMS query processing and optimization. We provide algorithms,
performance metrics, guarantees, and empirical results for two
important classes of continuous queries: pipelined stream filters and
multiway stream joins. Our techniques have been implemented in
StreaMon, the adaptive query processing engine we developed for
Stanford's STREAM prototype DSMS. We also outline general research
directions for adaptive processing as a key component of future data
management systems.
** Continuous Queries over Data Streams
Shivnath Babu, PhD Thesis
Continuous queries (CQs) represent a new paradigm for interacting
with dynamically-changing data. Unlike traditional one-time queries,
a CQ is registered with a data management system and provides
continuous results as data and updates stream into the system.
Applications include tracking real-time trends in stock market data,
monitoring the health of a computer network, or online processing of
sensor data.
As part of this thesis, we have addressed several challenges in
building a system for processing declaratively-specified continuous
queries. We modified and extended a traditional database query
language for the data streams and CQ context. We implemented this
language in STREAM, a complete prototype for data stream management.
Because CQs are long-running, we studied their memory requirements,
providing a precise characterization of memory required by any query
in our language. We also developed techniques for sharing resources
such as computation and state across multiple CQs, enabling
scalability to a very large number of concurrent CQs.