Data Mining Techniques for Structured and Semistructured Data

Abstract

Data mining is the application of sophisticated analysis to large amounts of data in order to discover new knowledge in the form of patterns, trends, and associations. With the advent of the World Wide Web, the amount of data stored and accessible electronically has grown tremendously and the process of knowledge discovery (data mining) from this data has become very important for the business and scientific-research communities alike.

In my talk, I will present Query Flocks, a general framework over relational data that enables the declarative formulation, systematic optimization, and efficient processing of a large class of mining queries. In Query Flocks, each mining problem is expressed as a datalog query with parameters and a filter condition. In the optimization phase, a query flock is transformed into a sequence of simpler queries that can be executed efficiently. As a proof of concept, I have integrated Query Flocks with a conventional database system and will report on the performance results.

While the Query-Flock framework is well suited for relational data, it has limited use for semistructured data, i.e., nested data with implicit and/or irregular structure, e.g. web pages. The lack of an explicit fixed schema makes semistructured data easy to generate or extract but hard to browse and query. In my talk, I will present methods for structure discovery in semistructured data that alleviate this problem. The discovered structure can be of varying precision and complexity. In particular, I will present an algorithm for deriving a schema-by-example and an algorithm for extracting an approximate schema in the form of a datalog program.


Svetlozar Nestorov
Last modified: Fri Sep 3 16:17:43 PDT