Integrating Data Mining with Relational DBMS: A Tightly-Coupled Approach
Data mining is rapidly finding its way into
mainstream computing. The development of generic methods such as
itemset counting has opened the area to academic inquiry and has
resulted in a large harvest of research results. While the mined
datasets are often in relational format, most mining systems do not
use relational DBMS. Thus, they miss the opportunity to leverage the
database technology developed in the last couple of decades. In this
paper, we propose a data mining architecture, based on the query flock
framework, that is tightly-coupled with RDBMS. To achieve optimal
performance we transform a complex data mining query into a sequence
of simpler queries that can be executed efficiently at the DBMS. We
present a class of levelwise algorithms that generate such
transformations for a large class of data mining queries. We also
present some experimental results that validate the viability of our
approach.
Representative Objects: Concise Representations of
Semistructured, Hierarchical Data
In this paper we
introduce the representative object, which uncovers the
inherent schema(s) in semistructured, hierarchical data sources and
provides a concise description of the structure of the data.
Semistructured data, unlike data stored in typical relational or
object-oriented databases, does not have fixed schema that is known in
advance and stored separately from the data. With the rapid growth of
the World Wide Web, semistructured hierarchical data sources are
becoming widely available to the casual user. The lack of external
schema information currently makes browsing and querying these data
sources inefficient at best, and impossible at worst. We show how
representative objects make schema discovery efficient and
facilitate the generation of meaningful queries over the data.
Inferring Structure in Semistructured Data
When dealing with semistructured data such as that available on the
Web, it becomes important to infer the inherent structure, both for
the user (e.g., to facilitate querying) and for the system (e.g., to
optimize access). In this paper, we consider the problem of
identifying some underlying structure in large collections of
semistructured data. Since we expect the data to be fairly irregular,
this structure consists of an approximate classification of objects
into a hierarchical collection of types. We propose a notion of a
type hierarchy for such data, and outline a method for deriving the
type hierarchy, and rules for assigning types to data elements.
Extracting Schema from Semistructured Data
Semistructured data is characterized by the lack of any
fixed and rigid schema, although typically the data has some implicit
structure. While the lack of fixed schema makes extracting semistructured
data fairly easy and an attractive goal, presenting and querying such
data is greatly impaired. Thus, a critical problem is the discovery of
the structure implicit in semistructured data and, subsequently, the
recasting of the raw data in terms of this structure.
In this paper, we consider a very general form of semistructured data
based on labeled, directed graphs. We show that such data can be typed
using the greatest fixpoint semantics of monadic datalog programs.
We present an algorithm for approximate typing of semistructured data.
We establish that the general problem of finding an optimal such typing
is NP-hard, but present some heuristics and techniques based on clustering
that allow efficient and near-optimal treatment of the problem. We also
present some preliminary experimental results.