DataMotion Report May 2006

DataMotion Report May 2006

--------------------------

Broader Impact:

---------------

We conducted a technology transfer workshop March 22, 2006 at

Stanford. In the workshop we reported on our DataMotion results of

the past two years, as well as on other work done in our group.

About 150 people participated, mainly from industry. The workshop

program can be found at

http://www-db.stanford.edu/datamotion/InfoLabWorkshop.htm

and photos from the event can be seen at

http://www-db.stanford.edu/~hector/photos/2006/infoLabWkshp/album.htm

We believe that this type of event helps disseminate

our results, and at the same time, helps

us learn about relevant problems in industry.

Research:

---------

During our third year, we have continued our DataMotion research,

in the four thrust areas. In what follows we describe

our results.

** Optimizing DataMotion Queries with Expensive Filters

DataMotion queries frequently need to perform expensive filtering

(e.g., image recognition, text analysis) on rapidly arriving

data. Furthermore, numerous continuous queries may be present and may

have some expensive filters in common. We consider the problem of

optimizing and executing multiple continuous queries, where each query

is a conjunction of filters and each filter may occur in multiple

queries. When filters are expensive, significant performance gains are

achieved by sharing filter evaluations across queries.

A shared execution strategy in our scenario can either be fixed, in

which filters are evaluated in the same predetermined order for all

input (suitable for relatively stable DataMotion scenarios), or

adaptive, in which the next filter to be evaluated is chosen at

runtime based on the results of the filters evaluated so far (suitable

for volatile scenarios). We present a greedy execution strategy and

show that it approximates the best strategy to within a factor

polylogarithmic in the number of queries and filters. We also show how

the execution overhead can be reduced by appropriate

precomputation. Finally, we present an experimental evaluation

demonstrating the effectiveness of our techniques.

K. Munagala, U. Srivastava, and J. Widom. Optimization of Continuous

Queries with Shared Expensive Filters. Technical Report, November

2005.

http://dbpubs.stanford.edu:8090/pub/2005-36

** Queries in DataMotion Networks with Web Services

Web services are becoming a standard method of sharing data and

functionality among loosely-coupled systems, and they are likely to

become a prevalent method of integrating DataMotion nodes. We propose

a general-purpose Web Service Query Processor that enables querying

multiple web services in a DataMotion network in a transparent and

integrated fashion.

We have first tackled the problem of query optimization for

Select-Project-Join queries spanning multiple web services. Our main

result is an algorithm for arranging a query's web service calls into

a pipelined execution plan that optimally exploits parallelism among

web services to minimize the query's total running time. Surprisingly,

the optimal plan can be found in polynomial time even in the presence

of arbitrary precedence constraints among web services, in contrast to

traditional query optimization where the analogous problem is

NP-hard. We also give an algorithm for determining the optimal

granularity of data "chunks" to be used for each web service

call. Experiments with an initial prototype indicate that our

algorithms can lead to significant performance improvement over more

straightforward techniques.

U. Srivastava, J. Widom, K. Munagala, R. Motwani. Query Optimization

over Web Services. Technical Report, October 2005.

http://dbpubs.stanford.edu:8090/pub/2005-30

** Cleaning of Sensor Data in a DataMotion Network

Many DataMotion applications reply on data captured from the physical

world through sensor devices. Data provided by these devices, however,

tends to be unreliable. The data must therefore be cleaned before it

is propagated into the DataMotion network. This work presents

"Extensible Sensor stream Processing" (ESP), a framework for building

sensor data cleaning infrastructures for use in DataMotion

networks. ESP is designed as a pipeline using declarative cleaning

mechanisms based on spatial and temporal characteristics of sensor

data. ESP's effectiveness has been demonstrated by experiments on

three real-world DataMotion scenarios.

S.R. Jeffery, G. Alonso, M.J. Franklin, W. Hong, and J. Widom. A

Pipelined Framework for Online Cleaning of Sensor Data Streams (short

paper). Proceedings of the Twenty-Second International Conference on

Data Engineering, Atlanta, Georgia, April 2006.

http://www.cs.berkeley.edu/~jeffery/pubs/esp-icde.pdf

S.R. Jeffery, G. Alonso, M.J. Franklin, W. Hong, and

J. Widom. Declarative Support for Sensor Data Cleaning. To appear in

Proceedings of the Fourth International Conference on Pervasive

Computing, Dublin, Ireland, May 2006.

http://www.cs.berkeley.edu/~jeffery/pubs/esp-pervasive.pdf

** Estimating the Quality of DataMotion Streams

Many DataMotion applications rely on streams of data gathered from

sensors, RFID readers, and image recognition systems, among others.

These raw data streams tend to be noisy, including both false

positives (erroneous readings) and false negatives (missed readings).

Techniques exist for general-purpose cleaning of these types of data

streams, based on temporal and/or spatial correlations, as well as

properties of the physical world. Cleaning is effective at improving

the quality of the data, however no cleaning procedures can eliminate

all errors. We have identified and addressed the problem of "quality

estimation" as object-detection data streams are cleaned. We provide

techniques for estimating both "confidence" and "coverage" as streams

are processed by cleaning modules. Detailed experimental results

based on an RFID application demonstrate the accuracy and

effectiveness of our approach.

A. Das Sarma, S.R. Jeffery, M.J. Franklin, and J. Widom. Estimating

Data Stream Quality for Object-Detection Applications. Technical

Report, December 2005.

http://dbpubs.stanford.edu:8090/pub/2005-37

** Security Configuration Management

Security configurations are introduced as a new model for the

description and analysis of secure data systems, including systems

that handle data streams. Both the longevity and privacy of sensitive

data are considered. The model uses two basic operators: copy, which

replicates data for longevity, and split, which decomposes data

(e.g., into ciphertext and a key) for privacy. The operators can be

recursively composed to describe how data and their associated

``keys'' are managed. Various classes of configurations are defined

that have desirable properties with respect to physical realizability

and semantic correctness. Formal techniques are provided to verify

these properties for a given configuration.

Mungamuru, Bob; Garcia-Molina, Hector. Security Configuration Management.

Technical Report, available at

http://dbpubs.stanford.edu/pub/2005-41

** Safeguarding Sensitive Data

As a follow on to our work on security configurations (see above), we

have we have studied how to design "good" configurations. That is,

in order to safeguard a sensitive database, we must ensure both its

privacy and its longevity. However, privacy and longevity tend to be

competing objectives. We studied the question of how to design a

system that provides both good privacy and good longevity. We

proposed metrics with which to evaluate the privacy, longevity and

performance offered by a security configuration. The search for the

"best" configuration under these metrics was then formulated as a

constrained optimization problem. Exact solution of the problem

turns out to be intractable, so we developed techniques for

efficiently finding an approximate solution.

Mungamuru, Bob; Garcia-Molina, Hector; Mitra, Subhashish.

How To Safeguard Your Sensitive Data.

Technical Report, available at

http://dbpubs.stanford.edu/pub/2006-9

** Parameterized Subscriptions

In our DataMotion project, we are studying how to effectively

implement a publish subscribe system for distributed data streams.

Traditional publish/subscribe systems commonly deal with static

subscriptions, whose event filtering criteria stay fixed once

defined. Although systems with static subscriptions are simpler to

implement, there are cases where the subscription criterion involves

state that changes frequently over time. Rather than having the user

re-submit his/her subscription repeatedly, we propose parameterized

subscriptions as a systematic solution for adaptive subscriptions.

Parameterized subscriptions depend on one or more parameters, which

are state variables stored and maintained automatically by the

publish/subscribe servers. We have modified traditional

publish/subscribe protocols in order to deal with parameterized

subscriptions. We have also looked at certain optimizations that

improve the efficiency by controlling where and how much state is

allocated in the system. Finally, we developed a simple evaluation

framework to illustrate the fundamental operating differences between

several schemes.

Yongqiang Huang and Hector Garcia-Molina;

Parameterized Subscriptions in Publish/Subscribe Systems;

Data & Knowledge Engineering Journal, Elsevier, to appear 2006.

** Vehicular Networks

Although not in our original proposal, we have also explored a new

direction related to data streams: a vehicular network where cars

communicate with each others using ad-hoc communications. Such

networks will soon be a reality as cars become equipped with wireless

communication system. One use of an inter-vehicle network is to

propagate alerts such as accidents and road conditions within a

region. Unlike previous work in the area that focuses on

instantaneous delivery of an alert to all reachable cars, our work

studied the problem where an alert needs to be maintained for a

duration of time. In other words, we must also notify cars that

become reachable after the alert begins. Maintaining an alert for a

duration is important because other cars can then take precautions or

change their travel path to avoid the condition. Moreover, we do not

require the original initiator of an alert to be stationary and

constantly repeating the alert. We formally defined the problem and

its correctness. We provided an efficient protocol that minimizes the

number of broadcasts needed for maintaining a regional alert over a

period of time, and we evaluated our protocol through simulation.

Sun, Qixiang; Garcia-Molina, Hector.

Using Ad-Hoc Inter-Vehicle Networks for Regional Alerts. The Second

ACM International Workshop on Vehicular Ad Hoc Networks (VANET 2005),

Cologne, Germany, September 2, 2005 (with Qi Sun).

Also available at http://dbpubs.stanford.edu/pub/2004-51

---------------------------------------

PhD Thesis:

** Efficient Query Processing for Modern Data Management

Utkarsh Srivastava, PhD Thesis

Efficient query processing in any data management system typically

relies on: (a) A profiling component that gathers statistics used to

evaluate possible query execution plans, and (b) A planning component

that picks the plan with the best predicted performance. For query

processing in a range of new data management scenarios, e.g., query

processing over data streams, and web services, traditional profiling

and planning techniques developed for conventional relational database

management systems are inadequate. We develop several novel profiling

and planning techniques to enable efficient query processing in these

new scenarios.

When data is arriving rapidly in the form of streams, and many

registered queries must be continuously executed over this data,

system resources such as memory and processing power may be stretched

to their limit. First, for a class of computation-intensive queries,

we describe how system throughput can be increased by exploiting

sharing of computation among the registered queries. Then, for a class

of memory-intensive queries, we consider the case when system memory

is insufficient for obtaining exact answers, and give techniques for

maximizing result accuracy under the given memory constraints. We then

consider a distributed setting such as that of a sensor network, and

give techniques for deciding the placement of query operators at

network nodes in order to minimize system-wide consumption of

resources.

We then consider the scenario of web services, which have been

emerging as a popular standard for sharing data and functionality

among loosely-coupled systems. For queries involving multiple web

services, we give algorithms for finding the optimal execution

plan. Finally, we turn to the profiling component, and describe new

techniques for gathering statistics by not looking at the data but

only at the query results. Such a technique is useful since data

access for collecting statistics may be infeasible, as for web

services, or undesirable, as for traditional databases.