Paper Abstracts

Query Flocks: A Generalization of Association-Rule Mining
Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the ``a-priori'' trick, make association-rule mining run much faster than might be expected. In this paper we see that the same tricks can be extended to a much more general context, allowing efficient mining of very large databases for many different kinds of patterns. The general idea, called ``query flocks,'' is a generate-and-test model for data-mining problems. We show how the idea can be used either in a general-purpose mining system or in a next generation of conventional query optimizers.

Integrating Data Mining with Relational DBMS: A Tightly-Coupled Approach
Data mining is rapidly finding its way into mainstream computing. The development of generic methods such as itemset counting has opened the area to academic inquiry and has resulted in a large harvest of research results. While the mined datasets are often in relational format, most mining systems do not use relational DBMS. Thus, they miss the opportunity to leverage the database technology developed in the last couple of decades. In this paper, we propose a data mining architecture, based on the query flock framework, that is tightly-coupled with RDBMS. To achieve optimal performance we transform a complex data mining query into a sequence of simpler queries that can be executed efficiently at the DBMS. We present a class of levelwise algorithms that generate such transformations for a large class of data mining queries. We also present some experimental results that validate the viability of our approach.

Representative Objects: Concise Representations of Semistructured, Hierarchical Data
In this paper we introduce the representative object, which uncovers the inherent schema(s) in semistructured, hierarchical data sources and provides a concise description of the structure of the data. Semistructured data, unlike data stored in typical relational or object-oriented databases, does not have fixed schema that is known in advance and stored separately from the data. With the rapid growth of the World Wide Web, semistructured hierarchical data sources are becoming widely available to the casual user. The lack of external schema information currently makes browsing and querying these data sources inefficient at best, and impossible at worst. We show how representative objects make schema discovery efficient and facilitate the generation of meaningful queries over the data.

Inferring Structure in Semistructured Data
When dealing with semistructured data such as that available on the Web, it becomes important to infer the inherent structure, both for the user (e.g., to facilitate querying) and for the system (e.g., to optimize access). In this paper, we consider the problem of identifying some underlying structure in large collections of semistructured data. Since we expect the data to be fairly irregular, this structure consists of an approximate classification of objects into a hierarchical collection of types. We propose a notion of a type hierarchy for such data, and outline a method for deriving the type hierarchy, and rules for assigning types to data elements.

Extracting Schema from Semistructured Data
Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure. While the lack of fixed schema makes extracting semistructured data fairly easy and an attractive goal, presenting and querying such data is greatly impaired. Thus, a critical problem is the discovery of the structure implicit in semistructured data and, subsequently, the recasting of the raw data in terms of this structure. In this paper, we consider a very general form of semistructured data based on labeled, directed graphs. We show that such data can be typed using the greatest fixpoint semantics of monadic datalog programs. We present an algorithm for approximate typing of semistructured data. We establish that the general problem of finding an optimal such typing is NP-hard, but present some heuristics and techniques based on clustering that allow efficient and near-optimal treatment of the problem. We also present some preliminary experimental results.

Svetlozar Nestorov

Last modified: Wed Sep 1 14:29:57 PDT