Impact:

-------

 

YEAR 2 UPDATE

-------------

 

Broader Impact:

---------------

 

We conducted two technology transfer workshops.

As mentioned earlier, this type of event helps disseminate

our results, and at the same time, helps

us learn about relevant problems in industry.

 

(1) We organized a workshop with IBM Alamden (January 26, 2005),

focusing on the topic of privacy and security (one of the four

DataMotion thrusts).

 

(2) We organized a workshop at Google (April 28, 2005),

where we reported ongoing work in our group.

 

 

Research:

---------

 

During our second year, we have continued our DataMotion research,

in the four thrust areas. In what follows we describe

our results.

 

** Deciding Placement of Operators in A DataMotion Network

 

In sensor networks, data acquisition frequently takes place at

low-capability devices. The acquired data is then transmitted through

a hierarchy of nodes having progressively increasing network bandwidth

and computational power. We have considered the problem of executing queries

over these data streams, posed at the root of the hierarchy. To

minimize data transmission, it is desirable to perform ``in-network''

query processing: do some part of the work at intermediate nodes as

the data travels to the root. Most previous work on in-network query

processing has focused on aggregation and inexpensive filters. In our work,

we address in-network processing for queries involving possibly

expensive conjunctive filters, and joins. We considered the problem of

placing operators along the nodes of the hierarchy so that the overall

cost of computation and data transmission is minimized. We showed that

the problem is tractable, giving an optimal algorithm, and demonstrating

that a simpler greedy operator placement algorithm can fail to find

the optimal solution. Finally we defined a number of interesting

variations of the basic operator placement problem and demonstrated

their hardness.

 

Srivastava, Utkarsh; Munagala Kamesh; Widom, Jennifer. Operator

Placement for In-Network Stream Query Processing, Proc. of PODS 2005,

June 2005.

 

 

** Adaptive Caching for Continuous Queries

 

We have  addressed the problem  of executing continuous  multiway join

queries in  unpredictable and  volatile environments. Our  query class

captures  windowed join  queries in  data  stream systems  as well  as

conventional  maintenance  of materialized  join  views. Our  adaptive

approach   handles   streams  of   updates   whose   rates  and   data

characteristics may  change over  time, as well  as changes  in system

conditions such  as memory  availability.  We focused  specifically on

the problem  of adaptive placement  and removal of caches  to optimize

join performance.   Our approach automatically  considers conventional

tree-shaped   join  plans  with   materialized  subresults   at  every

intermediate  node,  subresult-free MJoins,  and  the entire  spectrum

between them.  We have developed  a family of algorithms for selecting

candidate  caches,  monitoring  their  cost and  benefits  in  current

conditions, allocating  memory to  caches, and adapting  as conditions

change. All of our algorithms were implemented in the STREAM prototype

Data Stream  Management System and a  thorough experimental evaluation

was performed.

 

Babu, Shivnath; Munagala, Kamesh; Widom, Jennifer; Motwani, Rajeev.

Adaptive Caching for Continuous Queries. In Proc. of ICDE 2005,

April 2005

 

** The Pipelined Set Cover Problem

 

The pipelined filters problem  in databases and data stream management

systems is  the problem of  finding the optimal  ordering of a  set of

dependent  selection or  join operators  in a  query  processor.  This

problem can  be abstracted  as a generalization  of the  classical set

cover problem.   In this  generalization, called pipelined  set cover,

the  sets are  applied sequentially  to the  universe and  the covered

elements are discarded.  The cost  of including a set in pipelined set

cover is directly proportional to  the number of uncovered elements at

the  point of applying  the set.   The cost  therefore depends  on the

ordering of  the sets,  in addition to  the sets themselves.   We have

shown that  several natural heuristics for this  NP-hard problem (such

as greedy set  cover and local search) can be  analyzed using a common

linear-programming framework.  This  approach bounds the approximation

ratio as well as the running time of the corresponding algorithms.  In

particular, we showed that the  greedy and local search algorithms are

4-approximations  for both  uniform and  nonuniform  processing costs,

which  is  in  contrast  to  the  logarithmic  performance  guarantees

achievable for classical set cover.   We have extended our analysis to

minimize the Lp-norm of the costs paid by the sets, where p >= 2 is an

integer, to examine the improvement in performance when the total cost

has   increasing   contribution  from   the   initial   sets  in   the

pipeline. Using a novel Lagrangian-relaxation analysis, we showed that

the  greedy  algorithm  is  a 9^(1/p)-approximation  for  the  uniform

processing  cost  model,  while   the  local  search  algorithm  is  a

4^(1/p)-approximation  when   the  costs  are   nonuniform.   We  also

addressed the  online version of  pipelined set cover and  developed a

competitive algorithm with a logarithmic performance guarantee.

 

Munagala,   Kamesh;   Babu;    Shivnath;   Motwani,   Rajeev;   Widom,

Jennifer. The Pipelined Set Cover Problem. In Proc. of ICDT, Jan. 2005

 

** Survey on Adaptive  Query  Processing

 

A great deal  of work on adaptive query processing  has been done over

the last few years: Adaptive  query processing has been used to detect

and correct optimizer errors due to incorrect statistics or simplified

cost metrics;  it has been applied to  long-running continuous queries

over  data  streams  whose   characteristics  change  over  time;  and

routing-based adaptive  query processing does away  with the optimizer

altogether. Despite this large  body of interrelated work, no unifying

comparison of adaptive query processing techniques or systems has been

attempted; we  have tackled this problem. Our  studies have identified

three families  of systems (plan-based,  CQ-based, and routing-based),

and we have compared them in detail with respect to the most important

aspects  of  adaptive   query  processing:  plan  quality,  statistics

monitoring  and re-optimization, plan  migration, and  scalability. We

have  proposed two new  approaches to  adaptive query  processing that

address some  of the shortcomings  revealed by our  in-depth analysis:

(1) "Proactive  re-optimization,"  in   which  the  optimizer  chooses

initial query  plans with the expectation of  re-optimization; and (2)

"Plan   logging,"  in  which   optimizer  decisions   under  different

conditions  are logged  over time,  enabling  plan re-use  as well  as

analysis of relevant statistics and benefits of adaptivity.

 

Babu,  Shivnath;  Bizarro, Pedro.  Adaptive  Query  Processing in  the

Looking  Glass. In Proc. of  the Second  Biennial  Conference  on

Innovative Data Systems Research (CIDR), Jan. 2005

 

 

** Approximate Counts and Quantiles over Sliding Windows

 

We considered the problem of maintaining two important statistics -

approximate frequency counts and quantiles - over stream sliding

windows using limited memory.  Frequency counts are used to compute

frequent elements in a dataset and quantiles to understand the

distribution of values.  For many practical stream rates and window

sizes, it is infeasible to store the entire contents of the current

window, so computing approximate statistics using limited memory is

an important problem.

 

We developed various deterministic and randomized algorithms for this

problem.  All of our algorithms require space that is logarithmic or

less in the window size.  Further, the algorithms admit an error

parameter that can be used to control the approximation.  The statistics

maintained by the algorithms can be made more accurate by using a

smaller error parameter, but this increases the space required by the

algorithm.

 

Arvind Arasu, Gurmeet Singh Manku.  Approximate Counts and Quantiles

over Sliding Windows.  PODS 2004.  June 2004.

 

** Resource Sharing in Continuous Sliding Window Aggregates

 

We considered the problem of resource sharing when processing large

numbers of continuous queries. We specifically addressed sliding-window

aggregates over data streams, an important class of continuous

operators for which sharing has not been addressed.  We developed a

suite of sharing techniques that cover a wide range of possible

scenarios: different classes of aggregation functions (algebraic,

distributive, holistic), different window types (time-based,

tuple-based, suffix, historical), and different input models (single

stream, multiple substreams). We developed precise theoretical

performance guarantees for our techniques, and showed their practical

effectiveness through a thorough experimental study.

 

Arvind Arasu, Jennifer Widom.  Resource Sharing in Continuous Sliding

Window Aggregates.  VLDB 2004.  Sept. 2004

 

** Caching Queues in Memory Buffers.

 

Motivated by the need for maintaining multiple, large queues of data in

modern high-performance systems, we study the problem of caching queues

in memory under the following simple, but widely applicable, model.

At each clock-tick, any number of data items may enter

the various queues, while data-items are consumed from the heads of the

queues. Since the number of unconsumed items may exceed memory buffer size,

some items in the queues need to be spilled to secondary storage and later

moved back into memory for consumption. We provide online queue-caching

algorithms under a number of interesting cost models.

 

R. Motwani, D. Thomas. Caching Queues in Memory Buffers. SODA 2004

 

 

** Operator Scheduling in Data Stream Systems.

B.Babcock, S. Babu, M. Datar, R. Motwani, D. Thomas. VLDB Journal 2004

 

In many applications involving continuous data streams, data arrival

is bursty and data rate fluctuates over time. Systems that seek to

give rapid or real-time query responses in such an environment must

be prepared to deal gracefully with bursts in data arrival without

compromising system performance. We studied one strategy for

processing bursty streams – adaptive, load-aware scheduling of query

operators to minimize resource consumption during times of peak load.

We showed that the choice of an operator scheduling strategy can have

significant impact on the runtime system memory usage as well as

output latency. Our aim was to design a scheduling strategy that

minimizes the maximum runtime system memory while maintaining the

output latency within prespecified bounds. We first developed Chain

scheduling, an operator scheduling strategy for data stream systems

that is near-optimal in minimizing runtime memory usage for any

collection of single-stream queries involving selections,

projections, and foreign-key joins with stored relations. Chain

scheduling also performs well for queries with sliding-window joins

over multiple streams and multiple queries of the above types.

However, during bursts in input streams, when there is a buildup of

unprocessed tuples, Chain scheduling may lead to high output latency.

We studied the online problem of minimizing maximum runtime memory,

subject to a constraint on maximum latency. We have derived

preliminary observations, negative results, and heuristics for this

problem. A thorough experimental evaluation has been conducted where

we demonstrate the potential benefits of Chain scheduling and its

different variants, compare it with competing scheduling strategies,

and validate our analytical conclusions.

 

B.Babcock, S. Babu, M. Datar, R. Motwani, D. Thomas.

Operator Scheduling in Data Stream Systems. VLDB Journal 2004

 

** A Distributed Architecture for Secure Database Services

 

Recent trends towards outsourcing has led to interest in the

database-as-a-service model where an organization uses an external

service provider to store and manage its data.  However, concerns,

legal and otherwise, over data privacy has led to the requirement

that the service provider not be able to obtain any private

information despite it storing all the organization's data. The

traditional approach to building such a "secure" database service has

been to make the organization encrypt its data before storing it in

the untrusted external database. Queries to the database are executed

over the encrypted data, and the results are brought over to the

"client" side where they are decrypted. The major difficulty with

this approach is that it is very hard to efficiently execute queries

over encrypted data and, often, one is left with no alternative but

to transport a large fraction of the database back to the client,

decrypt it on the client side, and then execute the query against

this decrypted database. Such an approach is clearly very expensive

and eliminates the potential advantages of data outsourcing.

 

We developed a new, distributed architecture that allows an

organization to outsource its data management to two untrusted but

non-colluding servers while preserving data privacy. The presence of

two servers enables efficient partitioning of data so that the

contents at any one server are guaranteed not to breach data privacy.

For example, it may be a breach of privacy to reveal the name of any

customer together with his/her social security number (SSN). However,

if we were to store customer names at one site, and SSNs at another,

neither site is capable of breaching privacy merely from the data

available to it. Moreover, since both the names nor the SSNs are

stored in the clear without any encryption, it is possible to

efficiently execute queries using distributed query processing

techniques. We studied the different challenges involved in enabling

such a distributed architecture for secure database services,

including problems of distributed-schema design, query processing,

optimization and execution.

 

Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani,

Srivastava, Thomas, and Xu, Two Can Keep a Secret: A Distributed

Architecture for Secure Database Services.  Proceedings of the Second

Biennial Conference on Innovative Data Systems Research (CIDR), pages

186--199, Asilomar, CA, January 2005.

 

 

 

--------------------------------------

Training:

 

YEAR 2 UPDATE

-------------

 

Two students completed their PhD Thesis in the DataMotion

area:

 

** Adaptive Processing in Data Stream Management Systems

   Arvind Arasu, PhD Thesis

 

Many modern applications deal with data that arrives continuously and

must be processed in real-time: network performance monitoring,

financial analysis over stock tickers, sensor data processing, and

others.  Data Stream Management Systems (DSMSs) have recently been

developed for this class of applications. DSMSs support continuous

long-running queries, so they face a fundamental challenge when

stream data and arrival characteristics or system conditions may vary

over time.  To address this problem, we developed adaptive approaches

to DSMS query processing and optimization.  We provide algorithms,

performance metrics, guarantees, and empirical results for two

important classes of continuous queries: pipelined stream filters and

multiway stream joins.  Our techniques have been implemented in

StreaMon, the adaptive query processing engine we developed for

Stanford's STREAM prototype DSMS.  We also outline general research

directions for adaptive processing as a key component of future data

management systems.

 

** Continuous Queries over Data Streams

   Shivnath Babu, PhD Thesis

 

Continuous queries (CQs) represent a new paradigm for interacting

with dynamically-changing data. Unlike traditional one-time queries,

a CQ is registered with a data management system and provides

continuous results as data and updates stream into the system.

Applications include tracking real-time trends in stock market data,

monitoring the health of a computer network, or online processing of

sensor data.

 

As part of this thesis, we have addressed several challenges in

building a system for processing declaratively-specified continuous

queries.  We modified and extended a traditional database query

language for the data streams and CQ context. We implemented this

language in STREAM, a complete prototype for data stream management.

Because CQs are long-running, we studied their memory requirements,

providing a precise characterization of memory required by any query

in our language. We also developed techniques for sharing resources

such as computation and state across multiple CQs, enabling

scalability to a very large number of concurrent CQs.