Impact:

-------

YEAR 2 UPDATE

-------------

Broader Impact:

---------------

We conducted two technology transfer workshops.

As mentioned earlier, this type of event helps disseminate

our results, and at the same time, helps

us learn about relevant problems in industry.

(1) We organized a workshop with IBM Alamden (January 26, 2005),

focusing on the topic of privacy and security (one of the four

DataMotion thrusts).

(2) We organized a workshop at Google (April 28, 2005),

where we reported ongoing work in our group.

Research:

---------

During our second year, we have continued our DataMotion research,

in the four thrust areas. In what follows we describe

our results.

** Deciding Placement of Operators in A DataMotion Network

In sensor networks, data acquisition frequently takes place at

low-capability devices. The acquired data is then transmitted through

a hierarchy of nodes having progressively increasing network bandwidth

and computational power. We have considered the problem of executing queries

over these data streams, posed at the root of the hierarchy. To

minimize data transmission, it is desirable to perform ``in-network''

query processing: do some part of the work at intermediate nodes as

the data travels to the root. Most previous work on in-network query

processing has focused on aggregation and inexpensive filters. In our work,

we address in-network processing for queries involving possibly

expensive conjunctive filters, and joins. We considered the problem of

placing operators along the nodes of the hierarchy so that the overall

cost of computation and data transmission is minimized. We showed that

the problem is tractable, giving an optimal algorithm, and demonstrating

that a simpler greedy operator placement algorithm can fail to find

the optimal solution. Finally we defined a number of interesting

variations of the basic operator placement problem and demonstrated

their hardness.

Srivastava, Utkarsh; Munagala Kamesh; Widom, Jennifer. Operator

Placement for In-Network Stream Query Processing, Proc. of PODS 2005,

June 2005.

** Adaptive Caching for Continuous Queries

We have addressed the problem of executing continuous multiway join

queries in unpredictable and volatile environments. Our query class

captures windowed join queries in data stream systems as well as

conventional maintenance of materialized join views. Our adaptive

approach handles streams of updates whose rates and data

characteristics may change over time, as well as changes in system

conditions such as memory availability. We focused specifically on

the problem of adaptive placement and removal of caches to optimize

join performance. Our approach automatically considers conventional

tree-shaped join plans with materialized subresults at every

intermediate node, subresult-free MJoins, and the entire spectrum

between them. We have developed a family of algorithms for selecting

candidate caches, monitoring their cost and benefits in current

conditions, allocating memory to caches, and adapting as conditions

change. All of our algorithms were implemented in the STREAM prototype

Data Stream Management System and a thorough experimental evaluation

was performed.

Babu, Shivnath; Munagala, Kamesh; Widom, Jennifer; Motwani, Rajeev.

Adaptive Caching for Continuous Queries. In Proc. of ICDE 2005,

April 2005

** The Pipelined Set Cover Problem

The pipelined filters problem in databases and data stream management

systems is the problem of finding the optimal ordering of a set of

dependent selection or join operators in a query processor. This

problem can be abstracted as a generalization of the classical set

cover problem. In this generalization, called pipelined set cover,

the sets are applied sequentially to the universe and the covered

elements are discarded. The cost of including a set in pipelined set

cover is directly proportional to the number of uncovered elements at

the point of applying the set. The cost therefore depends on the

ordering of the sets, in addition to the sets themselves. We have

shown that several natural heuristics for this NP-hard problem (such

as greedy set cover and local search) can be analyzed using a common

linear-programming framework. This approach bounds the approximation

ratio as well as the running time of the corresponding algorithms. In

particular, we showed that the greedy and local search algorithms are

4-approximations for both uniform and nonuniform processing costs,

which is in contrast to the logarithmic performance guarantees

achievable for classical set cover. We have extended our analysis to

minimize the Lp-norm of the costs paid by the sets, where p >= 2 is an

integer, to examine the improvement in performance when the total cost

has increasing contribution from the initial sets in the

pipeline. Using a novel Lagrangian-relaxation analysis, we showed that

the greedy algorithm is a 9^(1/p)-approximation for the uniform

processing cost model, while the local search algorithm is a

4^(1/p)-approximation when the costs are nonuniform. We also

addressed the online version of pipelined set cover and developed a

competitive algorithm with a logarithmic performance guarantee.

Munagala, Kamesh; Babu; Shivnath; Motwani, Rajeev; Widom,

Jennifer. The Pipelined Set Cover Problem. In Proc. of ICDT, Jan. 2005

** Survey on Adaptive Query Processing

A great deal of work on adaptive query processing has been done over

the last few years: Adaptive query processing has been used to detect

and correct optimizer errors due to incorrect statistics or simplified

cost metrics; it has been applied to long-running continuous queries

over data streams whose characteristics change over time; and

routing-based adaptive query processing does away with the optimizer

altogether. Despite this large body of interrelated work, no unifying

comparison of adaptive query processing techniques or systems has been

attempted; we have tackled this problem. Our studies have identified

three families of systems (plan-based, CQ-based, and routing-based),

and we have compared them in detail with respect to the most important

aspects of adaptive query processing: plan quality, statistics

monitoring and re-optimization, plan migration, and scalability. We

have proposed two new approaches to adaptive query processing that

address some of the shortcomings revealed by our in-depth analysis:

(1) "Proactive re-optimization," in which the optimizer chooses

initial query plans with the expectation of re-optimization; and (2)

"Plan logging," in which optimizer decisions under different

conditions are logged over time, enabling plan re-use as well as

analysis of relevant statistics and benefits of adaptivity.

Babu, Shivnath; Bizarro, Pedro. Adaptive Query Processing in the

Looking Glass. In Proc. of the Second Biennial Conference on

Innovative Data Systems Research (CIDR), Jan. 2005

** Approximate Counts and Quantiles over Sliding Windows

We considered the problem of maintaining two important statistics -

approximate frequency counts and quantiles - over stream sliding

windows using limited memory. Frequency counts are used to compute

frequent elements in a dataset and quantiles to understand the

distribution of values. For many practical stream rates and window

sizes, it is infeasible to store the entire contents of the current

window, so computing approximate statistics using limited memory is

an important problem.

We developed various deterministic and randomized algorithms for this

problem. All of our algorithms require space that is logarithmic or

less in the window size. Further, the algorithms admit an error

parameter that can be used to control the approximation. The statistics

maintained by the algorithms can be made more accurate by using a

smaller error parameter, but this increases the space required by the

algorithm.

Arvind Arasu, Gurmeet Singh Manku. Approximate Counts and Quantiles

over Sliding Windows. PODS 2004. June 2004.

** Resource Sharing in Continuous Sliding Window Aggregates

We considered the problem of resource sharing when processing large

numbers of continuous queries. We specifically addressed sliding-window

aggregates over data streams, an important class of continuous

operators for which sharing has not been addressed. We developed a

suite of sharing techniques that cover a wide range of possible

scenarios: different classes of aggregation functions (algebraic,

distributive, holistic), different window types (time-based,

tuple-based, suffix, historical), and different input models (single

stream, multiple substreams). We developed precise theoretical

performance guarantees for our techniques, and showed their practical

effectiveness through a thorough experimental study.

Arvind Arasu, Jennifer Widom. Resource Sharing in Continuous Sliding

Window Aggregates. VLDB 2004. Sept. 2004

** Caching Queues in Memory Buffers.

Motivated by the need for maintaining multiple, large queues of data in

modern high-performance systems, we study the problem of caching queues

in memory under the following simple, but widely applicable, model.

At each clock-tick, any number of data items may enter

the various queues, while data-items are consumed from the heads of the

queues. Since the number of unconsumed items may exceed memory buffer size,

some items in the queues need to be spilled to secondary storage and later

moved back into memory for consumption. We provide online queue-caching

algorithms under a number of interesting cost models.

R. Motwani, D. Thomas. Caching Queues in Memory Buffers. SODA 2004

** Operator Scheduling in Data Stream Systems.

B.Babcock, S. Babu, M. Datar, R. Motwani, D. Thomas. VLDB Journal 2004

In many applications involving continuous data streams, data arrival

is bursty and data rate fluctuates over time. Systems that seek to

give rapid or real-time query responses in such an environment must

be prepared to deal gracefully with bursts in data arrival without

compromising system performance. We studied one strategy for

processing bursty streams – adaptive, load-aware scheduling of query

operators to minimize resource consumption during times of peak load.

We showed that the choice of an operator scheduling strategy can have

significant impact on the runtime system memory usage as well as

output latency. Our aim was to design a scheduling strategy that

minimizes the maximum runtime system memory while maintaining the

output latency within prespecified bounds. We first developed Chain

scheduling, an operator scheduling strategy for data stream systems

that is near-optimal in minimizing runtime memory usage for any

collection of single-stream queries involving selections,

projections, and foreign-key joins with stored relations. Chain

scheduling also performs well for queries with sliding-window joins

over multiple streams and multiple queries of the above types.

However, during bursts in input streams, when there is a buildup of

unprocessed tuples, Chain scheduling may lead to high output latency.

We studied the online problem of minimizing maximum runtime memory,

subject to a constraint on maximum latency. We have derived

preliminary observations, negative results, and heuristics for this

problem. A thorough experimental evaluation has been conducted where

we demonstrate the potential benefits of Chain scheduling and its

different variants, compare it with competing scheduling strategies,

and validate our analytical conclusions.

B.Babcock, S. Babu, M. Datar, R. Motwani, D. Thomas.

Operator Scheduling in Data Stream Systems. VLDB Journal 2004

** A Distributed Architecture for Secure Database Services

Recent trends towards outsourcing has led to interest in the

database-as-a-service model where an organization uses an external

service provider to store and manage its data. However, concerns,

legal and otherwise, over data privacy has led to the requirement

that the service provider not be able to obtain any private

information despite it storing all the organization's data. The

traditional approach to building such a "secure" database service has

been to make the organization encrypt its data before storing it in

the untrusted external database. Queries to the database are executed

over the encrypted data, and the results are brought over to the

"client" side where they are decrypted. The major difficulty with

this approach is that it is very hard to efficiently execute queries

over encrypted data and, often, one is left with no alternative but

to transport a large fraction of the database back to the client,

decrypt it on the client side, and then execute the query against

this decrypted database. Such an approach is clearly very expensive

and eliminates the potential advantages of data outsourcing.

We developed a new, distributed architecture that allows an

organization to outsource its data management to two untrusted but

non-colluding servers while preserving data privacy. The presence of

two servers enables efficient partitioning of data so that the

contents at any one server are guaranteed not to breach data privacy.

For example, it may be a breach of privacy to reveal the name of any

customer together with his/her social security number (SSN). However,

if we were to store customer names at one site, and SSNs at another,

neither site is capable of breaching privacy merely from the data

available to it. Moreover, since both the names nor the SSNs are

stored in the clear without any encryption, it is possible to

efficiently execute queries using distributed query processing

techniques. We studied the different challenges involved in enabling

such a distributed architecture for secure database services,

including problems of distributed-schema design, query processing,

optimization and execution.

Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani,

Srivastava, Thomas, and Xu, Two Can Keep a Secret: A Distributed

Architecture for Secure Database Services. Proceedings of the Second

Biennial Conference on Innovative Data Systems Research (CIDR), pages

186--199, Asilomar, CA, January 2005.

--------------------------------------

Training:

YEAR 2 UPDATE

-------------

Two students completed their PhD Thesis in the DataMotion

area:

** Adaptive Processing in Data Stream Management Systems

Arvind Arasu, PhD Thesis

Many modern applications deal with data that arrives continuously and

must be processed in real-time: network performance monitoring,

financial analysis over stock tickers, sensor data processing, and

others. Data Stream Management Systems (DSMSs) have recently been

developed for this class of applications. DSMSs support continuous

long-running queries, so they face a fundamental challenge when

stream data and arrival characteristics or system conditions may vary

over time. To address this problem, we developed adaptive approaches

to DSMS query processing and optimization. We provide algorithms,

performance metrics, guarantees, and empirical results for two

important classes of continuous queries: pipelined stream filters and

multiway stream joins. Our techniques have been implemented in

StreaMon, the adaptive query processing engine we developed for

Stanford's STREAM prototype DSMS. We also outline general research

directions for adaptive processing as a key component of future data

management systems.

** Continuous Queries over Data Streams

Shivnath Babu, PhD Thesis

Continuous queries (CQs) represent a new paradigm for interacting

with dynamically-changing data. Unlike traditional one-time queries,

a CQ is registered with a data management system and provides

continuous results as data and updates stream into the system.

Applications include tracking real-time trends in stock market data,

monitoring the health of a computer network, or online processing of

sensor data.

As part of this thesis, we have addressed several challenges in

building a system for processing declaratively-specified continuous

queries. We modified and extended a traditional database query

language for the data streams and CQ context. We implemented this

language in STREAM, a complete prototype for data stream management.

Because CQs are long-running, we studied their memory requirements,

providing a precise characterization of memory required by any query

in our language. We also developed techniques for sharing resources

such as computation and state across multiple CQs, enabling

scalability to a very large number of concurrent CQs.