Extracting Structured Data from Web Pages: Experiments

Introduction

Many web sites contain large collections of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all the book pages. The values used to generate the pages (e.g., the author, title, ...) typically come from a database. We have studied the problem of automatically extracting database values from such a collection of web pages automatically without any human input. Please follow this link for the paper discussing the techniques that we have developed for the above problem. This page contains the experimental results of applying our techniques to real web page collections. Some of the collections that we used in our experiments were obtained from RoadRunner Project which tries to solve a similar problem. The other collections were manually crawled from well-known data-rich sites like E-bay and Netflix .

Organization of Experimental Results

We briefly describe how we have organized the experimental results. For each input collection of web pages that we used we present the following information as part of the experimental results.

Source Pages: The source pages in the collection.
Extracted Template: The template deduced by our system for the collection.
Extracted Schema: The schema deduced by our system.
Extracted Data: The data encoded in each page that is extracted by our system.
Equivalence Classes: Equivalence classes are sets of words that are used by our system to construct the template. Please refer to the paper for the definition of equivalence classes.
Manual Schema: The schema that we deduced manually using the semantics of the information in the pages. This is used for evaluating the system.

Formats

The extracted schema, value and template are output by our sytem in XML.

Schema

The following text illustrates how we encode a schema in XML.

<schema id="1">
  <tuple id="2" order="2">
    <basic id="3"/>
    <set id="4">
      <basic id="5"/>
    </set>
  </tuple>
</schema>

The schema represented by the above XML text is a tuple with two attributes; the first attribute is of basic type (string); and, the second attribute is a set of basic type. Each element has an unique attribute id.

Value

The following example value that is instance of the schema above illustrates how we encode a value in XML.

<value instanceof="1">
  <value instanceof="2">
    <value instanceof="3">
      <![CDATA[What is Mathematics]]>
    </value>
    <value instanceof="4">
      <value instanceof="5">
        <![CDATA[Courant]]>
      </value>
      <value instanceof="5">
        <![CDATA[Robbins]]>
      </value>      
    </value>
  </value>
</value>

The instanceof attribute of a <value> element corresponds to the id attribute of a type in the schema of which the value is an instance.

Template

The following example template for the schema above illustrates how we represent a template in XML.

<template schema="1">
  <start-string context="2">
    <![CDATA[<html> <body> Book:]]>
  </start-string>
  <start-string context="5">
    <![CDATA[Author:]]>
  </start-string>
  <end-string context="2">
    <![CDATA[</body> </html>]]>
  </end-string>
</template>

The encoding of the value above using the template above results in the following page:

<html>
  <body>
    Book: What is Mathematics
    Author: Courant
    Author: Robbins
  </body>
</html>

A template is just set of optional start-string and end-strings associated with each type in the schema. The context attribute in the <start-string> and <end-string> elements identifies the type in the schema that the element in associated with. In an encoded page, the "start-string" occurs before the encoding a sub-value of the type that it is associated with, and the "end-string" after. The above representation of the template is equivalent to our definition of a template in the paper .

Experimental Results

The following are the links for experimental results on various collections.