Hector
Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom
(co-PI)
Department of Computer Science, Stanford University
Hector
Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom
Sites relevant to
the grant include: DB Group home page,
Infolab home page, NLP Group home page, Digital Libraries project home page.
Project Award
Information
Keywords
World-Wide Web,
information retrieval, database, natural-language processing, data mining
Project Summary
Our proposed work
is driven by the vision of a Global InfoBase (GIB): a ubiquitous and universal
information resource, simple to use, up to date, and comprehensive. The project
consists of four interrelated thrusts: (i) Combining Technologies:
integrating technologies for information retrieval, database management, and
hypertext navigation, to achieve a "universal" information model;
(ii) Personalization: developing tools for personalizing information
management; (iii) Semantics: Using natural-language processing and structural
techniques for analyzing the semantics of Web pages; and (iv) Data Mining:
designing new algorithms for mining information in order to synthesize new
knowledge.
Publications and
Products
Project Impact
Goals,
Objectives, and Targeted Activities
· Combining
Technologies: We cannot achieve the challenging goals of our Global
Information Base without (a) extending existing technologies to work in vast
information spaces and, (b) developing new technologies where existing ones
fail to scale. We investigated: Multi-Model
Queries [1] examines how relational and full-text queries may be combined
to yield results over multiple sources with radically different data models
(e.g.,"Find all Web pages that contain the phrase 'National Science
Foundation', and that are being linked to by at least 10 other Web pages.").
For our testbed, we use WebBase, a 120
million page repository, developed as part of our Digital Library sister
project. Search Over New Sources: We have been integrating new classes of information into the Global
Infobase: Music and Internet chat rooms. Sound analysis is a notoriously
difficult problem, especially in the realm of music, where acoustically very
different signals nevertheless represent at some semantic level identical
material. In [2] we document our new system to retrieve similar music pieces from
an audio database without metadata or other symbolic information. The
real-time, conversational nature of Internet relay chat (IRC) poses a
number of interesting problems with respect to indexing archives for effective
search and we present our preliminary results in [3]. Finally, in our technology integration
thrust, we developed new algorithms for scalable saches that serve vast
numbers of cooperating sources. In [4], we present a best-effort
synchronization scheduling policy that exploits cooperation between data
sources and the cache.
· Personalization:
If our Global Information Base harbors a dark side, it is an escalation of the
well-documented 'information overload' problem. We have therefore focused on
the personalization in the interaction with information sources. Context-Sensitive
Search: Many Web search engines compute absolute rankings for Web pages. While
highly effective, this ranking does not take into account the user's context. In
[9], we developed and implemented a topic-sensitive link-based ranking measure
for web search, which exploits search context, including query context (e.g.,
query history) and user context (e.g., bookmarks and browsing history) to
enhance the precision of search engines. Structure-Based Similarity: In
[13] and [14] we document our efforts in deducing Web page similarity through
the analysis of Web structure. For example, pages with common parentage might
be related. Personalized-Precision Retrieval: Building on our scalable
caching work [4], we were able to provide personalization of querying in a
novel area. We enable users to specify standing queries augmented by a
specification of how up-to-date the
results need to be. This quality of service measure is important to our Global
Information Base, because standing queries are only feasible if the system has
some 'wiggle room' to optimize its operation [5,6].
· Semantics:
We are attacking problems in getting more meaning out of web pages than simply
lists of words they contain. Much work has been done on clustering documents, but
little on the problem of labeling clusters. For effective human
navigation, the quality of the labeling is at least as important as the quality
of the underlying clustering technique, and [7] studies the effectiveness of
current labeling techniques and devises algorithms for generating more
effective labels. In [8], we address the problem of designing a crawler capable
of extracting content from the hidden Web (pages behind forms). We
introduce a new Layout-based Information Extraction Technique (LITE) and
demonstrate its use in automatically extracting semantic information from
search forms and response pages. In ongoing unpublished work, we are continuing
work on web wrappers. Many web sites contain large sets of pages generated from
a database using a common template. We are developing an algorithm to extract
database values from web pages which uses sets of words that have similar
occurrence patterns in the input pages to construct the template. Experiments
show that the extracted values make semantic sense in most cases.
· Data
Mining: Work in this area applies data mining and machine learning
techniques to automate tasks of web analysis. In [10], we developed a methodology
for evaluating various strategies for similarity search on the web,
using the Open Directory (a free Yahoo!-like hierarchy) as an external quality
measure. The space of alternatives tried includes link structure, words in the
documents, in "anchor text" leading to the document, and words
surrounding the anchor. The best results include use of text surrounding the
anchor, with a weight for closeness. Clustering is a central problem in
exploratory data mining, with strong web applications. We have explored two
more fundamental pieces of research. [11] presents an improved method for data
clustering in the presence of sparse prior knowledge, given in the form of
pairwise instance constraints. By allowing these constraints to have
space-level effects, we are able to exploit constraints more effectively than
prior work. In [12], we discuss that classical hierarchical agglomerative
clustering methods, while widely used, have lacked a solid theoretical
foundation, and remedy this situation by providing probabilistic generative
models for these methods.
Project References.
The main references are
listed under Publications and Products above.
Area Background
The World-Wide Web
has created a resource comprising much of the world's knowledge and is
incorporating a progressively larger fraction each year. Yet today our ability
to use the Web as an information resource -- whether to advance science, to
enhance our lives, or just to get the best deal on a videotape -- is in a
primitive state. Information on the Web is often hard to find and may be of
dubious quality. Although information is presented in a universal HTML format,
there are many fundamental differences across sites: words have different
meanings, information is structured differently or not at all, and query
interfaces vary widely. Our ultimate goal is not to replace the Web with a new
information resource, but rather to add functionality and tools to the Web --
to transform it into the Global InfoBase.
Area References.
Please see Publications and Products above.