Large Scale Copy Detection

My thesis work focuses on large scale copy detection of digital objects such as textual documents, audio and video on the world-wide web. The web is a content publisher's nightmare come true. Currently, any small time cyber-pirate can make copies of music CDs and books available on the web in digital format to a large audience at virtually no cost. In my thesis, I focus on building a copy detection system (CDS) into which content publishers register their valuable digital content. The CDS then crawls the web, compares the web content to the registered content and notifies the content owners of illegal copies. The key challenges in building such a system are to balance For this, I have developed a core architecture that can be used to build a CDS for a variety of data types. As proof of concept, I have built two prototype CDS: (1) SCAM (Stanford Copy Analysis Mechanism), for finding textual copies on the web and (2) FRAUD (Finding Replicas of AUDio) for finding audio copies on the web.

SCAM was successfully used in May 1995 to find several instances of plagiarism in conference papers and journal articles. Click here for details.

Here is a little blurb that gives a 2-page overview of my thesis research. The following papers are detailed technical notes on various problems we attacked as part of my thesis.

    Invited papers

  1. Safeguarding and Charging for Information on the Internet
    H. Garcia-Molina, S. P. Ketchpel, N. Shivakumar
    International Conference on Data Engineering (ICDE'98)

  2. The SCAM Approach To Copy Detection in Digital Libraries
    N. Shivakumar, H. Garcia-Molina
    D-lib Magazine , November 1995.

    Conference Publications

  3. Computing Iceberg Queries Efficiently
    M. Fang, N. Shivakumar , H. Garcia-Molina, R. Motwani, J.D. Ullman
    Proceedings of 1998 International Conference on Very Large Databases (VLDB'98) , New York, August 1998.

  4. Filtering with Approximate Predicates
    N. Shivakumar , H. Garcia-Molina, C.S. Chekuri
    Proceedings of 1998 International Conference on Very Large Databases (VLDB'98) , New York, August 1998.

  5. Finding near-replicas of documents on the web
    N. Shivakumar , H. Garcia-Molina
    Proceedings of Workshop on Web Databases (WebDB'98) held in conjuntion with EDBT'98, Mar 1998.

  6. Wave Indices: Indexing Evolving Databases
    N. Shivakumar, H. Garcia-Molina
    Proceedings of 1997 ACM International Conference On Management of Data, 1997 (SIGMOD'97), Tuscon, Arizona, May'97.

  7. dSCAM : Finding Document Copies Across Multiple Databases.
    H. Garcia-Molina, L. Gravano, N. Shivakumar
    Proceedings of 4th International Conference on Parallel and Distributed Systems (PDIS'96) , Miami Beach, Dec'96

  8. Building a Scalable and Accurate Copy Detection Mechanism
    N. Shivakumar , H. Garcia-Molina
    Proceedings of 1st ACM Conference on Digital Libraries (DL'96) , Bethesda, Maryland, Mar'96

  9. SCAM: A Copy Detection Mechanism for Digital Documents
    N. Shivakumar , H. Garcia-Molina
    Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL'95) , Austin, Texas, June '95.


In Print

Check out these articles written about my SCAM work.