MG = Managing Gigabytes, by Witten, Moffat, and Bell.
MIR = Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto.
PDDS = Principles of Distributed Database Systems, by Tamer Ozsu and Patrick Valduriez.
(See Course Information for complete details of these books.)

Date Topics Notes Readings
4/4/2001 Introduction, Inverted indexes, Issues in building such indexes, Course administrivia [powerpoint]
[pdf (large)]
[pdf (small)]
MG Ch. 3, MIR Ch. 7.2
Porter's stemmer
Shakespeare plays
4/9/2001 Inverted index storage, Boolean queries, Wild-card queries, Positional/phrase queries, Evaluating IR systems [powerpoint]
[pdf (large)]
[pdf (small)]
MG Ch. 4, MIR Ch. 3
Princeton Wordnet
4/11/2001 Section: IR project, VDK software [powerpoint]
[html]
Installing VDK course page
4/16/2001 Index construction, Dynamic indices (updating), Term weighting, Vector space indices [powerpoint]
[pdf (large)]
[pdf (small)]
MG Ch. 5, MIR Ch. 2.5.3
4/18/2001 Computing cosine-based ranking, Speeding up cosine ranking (Sampling and pre-grouping, Latent semantic indexing, Random projection) [powerpoint]
[pdf (large)]
[pdf (small)]
MG Ch. 4.6, MIR Ch. 2.7.2
Random projection theorem
Faster random projection
Latent semantic indexing
4/18/2001 Section: IR project 2, VDK software [powerpoint]
[html]
none
4/23/2001 Generalized query operators, Bayesian nets for Text Retrieval, Structured+ Unstructured queries [powerpoint]
[pdf (large)]
[pdf (small)]
MIR Ch. 2.6, 2.8
Bayesian Resources
4/25/2001 Link-based ranking in web search engines [powerpoint]
[pdf (large)]
[pdf (small)]
MIR 13
Anatomy of a large-scale hypertextual web search engine
Authoritative sources in a hyperlinked environment
Hypersearching the Web
Dubhashi resource collection covering recent topics
4/30/2001 Rest of web ranking, Peer-to-peer search, Search deployment models, Review of search topics [powerpoint]
[pdf (large)]
[pdf (small)]
MIR 9
5/7/2001 Document Clustering [powerpoint]
[pdf (large)]
[pdf (small)]
Yale results clustering demo
5/9/2001 Automatic document classification [powerpoint]
[pdf (large)]
[pdf (small)]
Resources for the lecture
5/14/2001 Centroid/Nearest-neighbor classification, Bayesian classification, Link-based classification, Document summarization [powerpoint]
[pdf (large)]
[pdf (small)]
Enhanced hypertext categorization using hyperlinks
Using lexical chains for text summarization
5/16/2001 Link-based clustering, Enumerative clustering/trawling, Recommendation systems [powerpoint]
[pdf (large)]
[pdf (small)]
Hypertext clustering: Clustering hypertext with applications to Web search
Duplicate detection: Syntactic clustering of the Web
A priori algorithm: Fast algorithms for mining association rules
Trawling:Trawling emerging cyber-communities automatically
5/21/2001 Web characterization; Research problems [powerpoint]
[pdf (large)]
[pdf (small)]
Broder et al. Graph structure in the Web
Jeong and Barabasi. Diameter of the world wide web
Faloutsos et al. On Power Law relationships of the Internet Topology
5/23/2001 Distributed databases - Introductory topics; Fragmentation; Allocation [powerpoint]
[pdf (large)]
[pdf (small)]
PDDS Ch. 5
5/30/2001 Query processing in distributed databases - localization, distributed query operators, optimization [powerpoint]
[pdf (large)]
[pdf (small)]
PDDS Ch. 7,8, and 9
6/4/2001 Concurrency Control (Schedules, Serializability, Locking, Timestamp control); Reliability (Failure models, 2-phase commit) [powerpoint]
[pdf (large)]
[pdf (small)]
Concurrency Control and Recovery in Database Systems
Ch. 9 of CS245 textbook (Database System Implementation)
6/6/2001 Reliability (3-phase commit, Majority 3PC); Network paritions [powerpoint]
[pdf (large)]
[pdf (small)]
Concurrency Control and Recovery in Database Systems
6/6/2001 Review session IR Part II
[ppt] [pdf]
Dist. DB
[ppt] [pdf]
You are also responsible for the material covered before the midterm.
These were the review slides for the midterm (at the end of lecture 7):
(IR Part I)
[ppt] [pdf]