FROM THE WEB TO THE GLOBAL INFOBASE 2003

FROM THE WEB TO THE GLOBAL INFOBASE

Hector Garcia-Molina (PI), Chris Manning (co-PI), Jeff Ullman (co-PI), Jennifer Widom (co-PI)
Department of Computer Science, Stanford University

Contact Information

Hector Garcia-Molina, Chris Manning, Jeff Ullman, Jennifer Widom
Dept. of Computer Science, Gates 4A
Stanford University
Stanford, CA 94305-9040
Phone: (650) 723-0872
Fax: (650) 725-2588
Email: {hector,manning,ullman,widom}@cs.stanford.edu
URLs:
http://www-db.stanford.edu/people/hector.html
http://www-nlp.stanford.edu/~manning
http://www-db.stanford.edu/~ullman
http://www-db.stanford.edu/~widom

WWW page

Our projects main Web page is at http://www-db.stanford.edu/gib/. Other sites relevant to the grant include: DB Group home page, Infolab home page, NLP Group home page, Digital Libraries project home page.

Project Award Information

Award Number: IIS-0085896
Duration: 9/1/00-8/31/03
Title: From the Web to the Global InfoBase

Keywords

World-Wide Web, information retrieval, database, natural-language processing, data mining

Project Summary

Our proposed work is driven by the vision of a Global InfoBase (GIB): a ubiquitous and universal information resource, simple to use, up to date, and comprehensive. The project consists of four interrelated thrusts: (i) Combining Technologies: integrating technologies for information retrieval, database management, and hypertext navigation, to achieve a "universal" information model; (ii) Personalization: developing tools for personalizing information management; (iii) Semantics: Using natural-language processing and structural techniques for analyzing the semantics of Web pages; and (iv) Data Mining: designing new algorithms for mining information in order to synthesize new knowledge.

Publications and Products

Sriram Raghavan and Hector Garcia-Molina. Integrating Diverse Information Management Systems: A Brief Survey. Proceedings of the IEEE Data Engineering Bulleting, December 2001.
Arvind Arasu and Hector Garcia-Molina. Extracting structured data from web pages. Proceedings of the ACM SIGMOD International Conference on Management of Data, June 2003.
Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1):2-43, August 2001.
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. Proceedings of the ACM Symposium on Principles of Database Systems, pages 1-16, 2002.
Brian Babcock and Chris Olston. Distributd top-k monitoring. Proceedings of the ACM SIGMOD International Conference on Management of Data, June 2003.
Cheng Yang. "Music Database Retrieval Based on Spectral Similarity." In International Symposium on Music Information Retrieval, October 2001.
T. Haveliwala. Search Facilities for Internet Relay Chat. Proceedings of the Joint Conference on Digital Libraries (Poster session), 2002.
C. Olston and J. Widom. Best-Effort Cache Synchronization with Source Cooperation. To appear: SIGMOD 2002.
C. Olston and J. Widom. Approximate Caching for Continuous Queries over Distributed Data Sources . February 2002 Technical Report.
C. Olston, B. T. Loo and J. Widom. Adaptive Precision Setting for Cached Approximate Values. ACM SIGMOD 2001. International Conference on Management of Data, May 2001.
D. Klein and T. Haveliwala. Concise Labeling of Document Clusters. Submitted. Technical Report, Stanford University, April 2002.
Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden Web. Proceedings of the 27th International Conf. on Very Large Databases (VLDB), pp. 129-138, September 2001.
T. Haveliwala. Topic-Sensitive PageRank. Proceedings of the Eleventh International World Wide Web Conference, 2002.
T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating Strategies for Similarity Search on the Web. Proceedings of the Eleventh International World Wide Web Conference, 2002.
D. Klein, S. Kamvar, and C. Manning. From Instance-level Constraints to Space-level Constraints: Making the Most of Prior Knowledge in Data Clustering. Proceedings of the Nineteenth International Conference on Machine Learning, 2002.
S. Kamvar, D. Klein, and C. Manning. Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based Approach. Proceedings of the Nineteenth International Conference on Machine Learning, 2002.
Glen Jeh and Jennifer Widom. SimRank: A Measure of Structural-Context Similarity. Technical Report, Computer Science Department, Stanford University, 2001.
Glen Jeh and Jennifer Widom. Scaling Personalized Web Search. In Proceedings of the 12th International World Wide Web Conference.
Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning. Named Entity Recognition with Character-Level Models. In CoNLL 2003.
Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. Extrapolation methods for accelerating PageRank computations. In Proceedings of the 12th International World Wide Web Conference, May 2003.
Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. Exploiting the block structure of the web for computing PageRank. Technical Report, Computer Science Department, Stanford University, 2003.
C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous queries over distributed data streams.. Proceedings of the ACM SIGMOD International Conference on Management of Data, June 2003
Sriram Raghavan and Hector Garcia-Molina. Representing Web graphs. Proceedings of the IEEE International Conference on Data Engineering, March 2003.
Sriram Raghavan and Hector Garcia-Molina. Complex queries over Web repositories. To appear in the Proceedings of the 29th International Conference on Very Large Databases (VLDB), September 2003.
Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning. Named Entity Recognition with Character-Level Models Proceedings the Seventh Conference on Natural Language Learning,2003, pp. 180-183.
Sepandar D. Kamvar, Dan Klein and Christopher D. Manning. Spectral Learning Proceedings of the International Joint Conference on Artificial Intelligence 2003.
Dan Klein and Christopher D. Manning. Accurate Unlexicalized Parsing. Proceedings of the Association for Computational Linguistics 2003.
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of Human Language Technology/North American Association for Computational Linguistics 2003.
Dan Klein and Christopher D. Manning. A* Parsing: Fast Exact Viterbi Parse Selection. Proceedings of Human Language Technology/North American Association for Computational Linguistics 2003.
Dan Klein and Christopher D. Manning. Fast Exact Inference with a Factored Model for Natural Language Parsing. To appear in Suzanna Becker, Sebastian Thrun, and Klaus Obermayer (eds), Advances in Neural Information Processing Systems 15 (NIPS 2002).

Project Impact

The project is funding several graduate students. Two of the supported students have finished their PhDs. Three more are expected to complete their PhD within the next year.
We have involved several undergraduate students, in particular supporting a couple a year performing research during summer through Stanford's new CURIS internship program.
We developed a new course on information retrieval extraction, and advanced web technologies, cotaught with researchers from local industry
Our WebBase repository of Web pages (jointly funded by this project and our Digital Library project) is being used for data mining by the following organizations: UC Berkeley, Columbia, U. Washington, Harvey Mudd, Università degli Studi di Milano, U. of Arizona, •California Digital Library, Cornell, U. of Houston, Learning Lab Lower Saxony (L3S), France Telecom, U. Texas.

Goals, Objectives, and Targeted Activities

· Combining Technologies: One of the main goals of our project was to develop a “universal” data model that combined facilities for simultaneously accessing text, relational data, and hypertext links. We have developed such a model, as well as a prototype for executing and optimizing queries in that model. In particular, we have developed: (i)~a formal data model for precisely defining query semantics, (ii)~a query algebra for formulating complex queries, and (iii)~techniques to optimize and efficiently execute these queries over massive data sets. Our algebra extends traditional relational operators as well as graph navigation operators, adding the notion of ranked, and ordered relations. We have applied our model and optimizer on our WebBase repository (see below), and have shown how Web pages can be searched based on their content, their properties (e.g., Page rank), or their link structure. We have measured the running time of representative complex Web queries, and have shown that our optimized can reduce running time by a factor of 2 or 3 in many cases.

· Personalization: Our Global Information Base makes accessible so much information that we must be very careful not to exacerbate the well-documented 'information overload' problem. We address information overload by enabling personalized views of information sources, particularly Web sources. Most Web search engines compute absolute rankings for Web pages, independent of the user or search context. In [13] we developed and implemented a topic-sensitive link-based ranking measure for web search that exploits search context, including query context (e.g., query history) and user context (e.g., bookmarks and browsing history) to enhance the precision of search engines. We then took a further step in personalization, enabling a different link-based ranking measure for each user, fully customizable. This work is described in [18]. “Personalized PageRank” enables users to specify any set of web pages as 'more important' when computing link-based importance rankings. Once the problem was formalized, the challenge was primarily one of efficiency and scalability: In practice Personalized PageRank may need to compute 2-to-the-N different rankings, where N is the number of pages on the Web! Our work focused on efficient algorithms that combine offline and online computation, and share state and computation as much as possible while computing fully personalized and customized rankings.

· Semantics: We continue to attack problems involved in getting more meaning out of web pages and other text than simply lists of the words that they contain. In [12], we address the problem of designing a crawler capable of extracting content from the hidden Web (pages behind forms). We introduce a new Layout-based Information Extraction Technique (LITE) and demonstrate its use in automatically extracting semantic information from search forms and response pages. In [2], we present results on learning web wrappers for sites with regular structure. Many web sites contain large sets of pages generated from a database using a common template. We have developed an algorithm to extract database values from web pages which uses sets of words that have similar occurrence patterns in the input pages to construct the template. Experiments show that the extracted values make semantic sense in most cases. Other web pages contain unstructured text paragraphs, and [25] examines methods for recognizing person, company, and other names for indexing and extraction, particularly exploiting discriminative machine learning techniques, and character-based models, which are very robust (for instance largely immune to the kind of obfuscation techniques used by spammers). [26] examines the use of spectral methods (related to those used in link analysis methods like PageRank) for the text-based tasks of document clustering and classification, showing how the eigenspace can efficiently bring out the groupings of documents in the space. [27, 28, 29 30] provide methods of doing natural language analysis, finding parts of speech and parsing sentences, particularly focusing on methods that can be efficiently implemented as well as of high accuracy. This kind of analysis is a basis for more detailed semantic understanding of unstructured text passages.

· Data Mining: We have continued our work on mining the Web. A central component of this work is our WebBase repository, a testbed for supporting research that examines many aspects of the World Wide Web. These aspects include experimentation towards understanding, optimally utilizing, and improving the Web. The system's highly configurable crawlers collect large numbers of Web pages, and store them locally. We test novel algorithms, such as for ranking, filtering, or Web linkage mapping on this collection. In addition, we enable other researchers around the world to request high-speed `feeds' of all or part of our collection. Such a feed allows these researchers to focus on their investigations, rather than on constructing crawlers. In fact, even with our present, small system, we have been able to help colleges around the country, supporting them in both their research and teaching.

Using WebBase, we have been studying methods for speeding up the computation of PageRank. In addition to making a very time-consuming operation more efficient, our results make search personalization (i.e., give each individual a Web-page-weighting function that reflects their personal interests; see above) feasible. In particular, in [20], we look at the sequence of 50 or so iterations of the matrix-vector multiplication that converges to the PageRank (principal eigenvector of the matrix). We observe that, in a manner reminiscent of Newton's method,we can predict where the terms of the vector are heading, and "jump ahead" in the calculation, approximately. The net effect is that about half the iterations are needed for a fixed level of convergence using this technique, while the extrapolation itself adds almost nothing to the cost of a single iteration. In [21] we take a completely different approach. Demonstrating the fact that there are clearly defined and easily discoverable regions of the Web that are dense in links (e.g., a university's portion of the.edu domain, or a typical person's Web site) we propose computing PageRank by first clustering the Web into some number of these regions, then computing the local PageRank within regions. Next,we use the local PageRank to estimate how much importance one region will pass to another, and compute the RageRank on the graph of regions only. Finally, the ordinary PageRank calculation is initialized with each page having the estimated PageRank equal to the product of its local PageRank and the PageRank of its region.

Project References.

The main references are listed under Publications and Products above.

Area Background

The World-Wide Web has created a resource comprising much of the world's knowledge and is incorporating a progressively larger fraction each year. Yet today our ability to use the Web as an information resource -- whether to advance science, to enhance our lives, or just to get the best deal on a videotape -- is in a primitive state. Information on the Web is often hard to find and may be of dubious quality. Although information is presented in a universal HTML format, there are many fundamental differences across sites: words have different meanings, information is structured differently or not at all, and query interfaces vary widely. Our ultimate goal is not to replace the Web with a new information resource, but rather to add functionality and tools to the Web -- to transform it into the Global InfoBase.

Area References.

Please see Publications and Products above.