DataMotion Report May 2006
--------------------------
Broader Impact:
---------------
We conducted a technology transfer workshop March 22, 2006 at
Stanford. In the workshop we reported on our DataMotion results of
the past two years, as well as on other work done in our group.
About 150 people participated, mainly from industry. The workshop
program can be found at
http://www-db.stanford.edu/datamotion/InfoLabWorkshop.htm
and photos from the event can be seen at
http://www-db.stanford.edu/~hector/photos/2006/infoLabWkshp/album.htm
We believe that this type of event helps disseminate
our results, and at the same time, helps
us learn about relevant problems in industry.
Research:
---------
During our third year, we have continued our DataMotion research,
in the four thrust areas. In what follows we describe
our results.
** Optimizing DataMotion Queries with Expensive Filters
DataMotion queries frequently need to perform expensive filtering
(e.g., image recognition, text analysis) on rapidly arriving
data. Furthermore, numerous continuous queries may be present and may
have some expensive filters in common. We consider the problem of
optimizing and executing multiple continuous queries, where each query
is a conjunction of filters and each filter may occur in multiple
queries. When filters are expensive, significant performance gains are
achieved by sharing filter evaluations across queries.
A shared execution strategy in our scenario can either be fixed, in
which filters are evaluated in the same predetermined order for all
input (suitable for relatively stable DataMotion scenarios), or
adaptive, in which the next filter to be evaluated is chosen at
runtime based on the results of the filters evaluated so far (suitable
for volatile scenarios). We present a greedy execution strategy and
show that it approximates the best strategy to within a factor
polylogarithmic in the number of queries and filters. We also show how
the execution overhead can be reduced by appropriate
precomputation. Finally, we present an experimental evaluation
demonstrating the effectiveness of our techniques.
K. Munagala, U. Srivastava, and J. Widom. Optimization of Continuous
Queries with Shared Expensive Filters. Technical Report, November
2005.
http://dbpubs.stanford.edu:8090/pub/2005-36
** Queries in DataMotion Networks with Web Services
Web services are becoming a standard method of sharing data and
functionality among loosely-coupled systems, and they are likely to
become a prevalent method of integrating DataMotion nodes. We propose
a general-purpose Web Service Query Processor that enables querying
multiple web services in a DataMotion network in a transparent and
integrated fashion.
We have first tackled the problem of query optimization for
Select-Project-Join queries spanning multiple web services. Our main
result is an algorithm for arranging a query's web service calls into
a pipelined execution plan that optimally exploits parallelism among
web services to minimize the query's total running time. Surprisingly,
the optimal plan can be found in polynomial time even in the presence
of arbitrary precedence constraints among web services, in contrast to
traditional query optimization where the analogous problem is
NP-hard. We also give an algorithm for determining the optimal
granularity of data "chunks" to be used for each web service
call. Experiments with an initial prototype indicate that our
algorithms can lead to significant performance improvement over more
straightforward techniques.
U. Srivastava, J. Widom, K. Munagala, R. Motwani. Query Optimization
over Web Services. Technical Report, October 2005.
http://dbpubs.stanford.edu:8090/pub/2005-30
** Cleaning of Sensor Data in a DataMotion Network
Many DataMotion applications reply on data captured from the physical
world through sensor devices. Data provided by these devices, however,
tends to be unreliable. The data must therefore be cleaned before it
is propagated into the DataMotion network. This work presents
"Extensible Sensor stream Processing" (ESP), a framework for building
sensor data cleaning infrastructures for use in DataMotion
networks. ESP is designed as a pipeline using declarative cleaning
mechanisms based on spatial and temporal characteristics of sensor
data. ESP's effectiveness has been demonstrated by experiments on
three real-world DataMotion scenarios.
S.R. Jeffery, G. Alonso, M.J. Franklin, W. Hong, and J. Widom. A
Pipelined Framework for Online Cleaning of Sensor Data Streams (short
paper). Proceedings of the Twenty-Second International Conference on
Data Engineering, Atlanta, Georgia, April 2006.
http://www.cs.berkeley.edu/~jeffery/pubs/esp-icde.pdf
S.R. Jeffery, G. Alonso, M.J. Franklin, W. Hong, and
J. Widom. Declarative Support for Sensor Data Cleaning. To appear in
Proceedings of the Fourth International Conference on Pervasive
Computing, Dublin, Ireland, May 2006.
http://www.cs.berkeley.edu/~jeffery/pubs/esp-pervasive.pdf
** Estimating the Quality of DataMotion Streams
Many DataMotion applications rely on streams of data gathered from
sensors, RFID readers, and image recognition systems, among others.
These raw data streams tend to be noisy, including both false
positives (erroneous readings) and false negatives (missed readings).
Techniques exist for general-purpose cleaning of these types of data
streams, based on temporal and/or spatial correlations, as well as
properties of the physical world. Cleaning is effective at improving
the quality of the data, however no cleaning procedures can eliminate
all errors. We have identified and addressed the problem of "quality
estimation" as object-detection data streams are cleaned. We provide
techniques for estimating both "confidence" and "coverage" as streams
are processed by cleaning modules. Detailed experimental results
based on an RFID application demonstrate the accuracy and
effectiveness of our approach.
A. Das Sarma, S.R. Jeffery, M.J. Franklin, and J. Widom. Estimating
Data Stream Quality for Object-Detection Applications. Technical
Report, December 2005.
http://dbpubs.stanford.edu:8090/pub/2005-37
** Security Configuration Management
Security configurations are introduced as a new model for the
description and analysis of secure data systems, including systems
that handle data streams. Both the longevity and privacy of sensitive
data are considered. The model uses two basic operators: copy, which
replicates data for longevity, and split, which decomposes data
(e.g., into ciphertext and a key) for privacy. The operators can be
recursively composed to describe how data and their associated
``keys'' are managed. Various classes of configurations are defined
that have desirable properties with respect to physical realizability
and semantic correctness. Formal techniques are provided to verify
these properties for a given configuration.
Mungamuru, Bob; Garcia-Molina, Hector. Security Configuration Management.
Technical Report, available at
http://dbpubs.stanford.edu/pub/2005-41
** Safeguarding Sensitive Data
As a follow on to our work on security configurations (see above), we
have we have studied how to design "good" configurations. That is,
in order to safeguard a sensitive database, we must ensure both its
privacy and its longevity. However, privacy and longevity tend to be
competing objectives. We studied the question of how to design a
system that provides both good privacy and good longevity. We
proposed metrics with which to evaluate the privacy, longevity and
performance offered by a security configuration. The search for the
"best" configuration under these metrics was then formulated as a
constrained optimization problem. Exact solution of the problem
turns out to be intractable, so we developed techniques for
efficiently finding an approximate solution.
Mungamuru,
Bob; Garcia-Molina, Hector; Mitra, Subhashish.
How To Safeguard Your Sensitive Data.
Technical Report, available at
http://dbpubs.stanford.edu/pub/2006-9
** Parameterized Subscriptions
In our DataMotion project, we are studying how to effectively
implement a publish subscribe system for distributed data streams.
Traditional publish/subscribe systems commonly deal with static
subscriptions, whose event filtering criteria stay fixed once
defined. Although systems with static subscriptions are simpler to
implement, there are cases where the subscription criterion involves
state that changes frequently over time. Rather than having the user
re-submit his/her subscription repeatedly, we propose parameterized
subscriptions as a systematic solution for adaptive subscriptions.
Parameterized subscriptions depend on one or more parameters, which
are state variables stored and maintained automatically by the
publish/subscribe servers. We have modified traditional
publish/subscribe protocols in order to deal with parameterized
subscriptions. We have also looked at certain optimizations that
improve the efficiency by controlling where and how much state is
allocated in the system. Finally, we developed a simple evaluation
framework to illustrate the fundamental operating differences between
several schemes.
Yongqiang Huang and Hector Garcia-Molina;
Parameterized Subscriptions in Publish/Subscribe Systems;
Data & Knowledge Engineering Journal, Elsevier, to appear 2006.
** Vehicular Networks
Although not in our original proposal, we have also explored a new
direction related to data streams: a vehicular network where cars
communicate with each others using ad-hoc communications. Such
networks will soon be a reality as cars become equipped with wireless
communication system. One use of an inter-vehicle network is to
propagate alerts such as accidents and road conditions within a
region. Unlike previous work in the area that focuses on
instantaneous delivery of an alert to all reachable cars, our work
studied the problem where an alert needs to be maintained for a
duration of time. In other words, we must also notify cars that
become reachable after the alert begins. Maintaining an alert for a
duration is important because other cars can then take precautions or
change their travel path to avoid the condition. Moreover, we do not
require the original initiator of an alert to be stationary and
constantly repeating the alert. We formally defined the problem and
its correctness. We provided an efficient protocol that minimizes the
number of broadcasts needed for maintaining a regional alert over a
period of time, and we evaluated our protocol through simulation.
Sun, Qixiang; Garcia-Molina, Hector.
Using Ad-Hoc Inter-Vehicle Networks for Regional Alerts. The Second
ACM International Workshop on Vehicular Ad Hoc Networks (VANET 2005),
Cologne, Germany, September 2, 2005 (with Qi Sun).
Also available at http://dbpubs.stanford.edu/pub/2004-51
---------------------------------------
PhD Thesis:
** Efficient Query Processing for Modern Data Management
Utkarsh Srivastava, PhD Thesis
Efficient query processing in any data management system typically
relies on: (a) A profiling component that gathers statistics used to
evaluate possible query execution plans, and (b) A planning component
that picks the plan with the best predicted performance. For query
processing in a range of new data management scenarios, e.g., query
processing over data streams, and web services, traditional profiling
and planning techniques developed for conventional relational database
management systems are inadequate. We develop several novel profiling
and planning techniques to enable efficient query processing in these
new scenarios.
When data is arriving rapidly in the form of streams, and many
registered queries must be continuously executed over this data,
system resources such as memory and processing power may be stretched
to their limit. First, for a class of computation-intensive queries,
we describe how system throughput can be increased by exploiting
sharing of computation among the registered queries. Then, for a class
of memory-intensive queries, we consider the case when system memory
is insufficient for obtaining exact answers, and give techniques for
maximizing result accuracy under the given memory constraints. We then
consider a distributed setting such as that of a sensor network, and
give techniques for deciding the placement of query operators at
network nodes in order to minimize system-wide consumption of
resources.
We then consider the scenario of web services, which have been
emerging as a popular standard for sharing data and functionality
among loosely-coupled systems. For queries involving multiple web
services, we give algorithms for finding the optimal execution
plan. Finally, we turn to the profiling component, and describe new
techniques for gathering statistics by not looking at the data but
only at the query results. Such a technique is useful since data
access for collecting statistics may be infeasible, as for web
services, or undesirable, as for traditional databases.