Parallel Massive Clustering of Discrete Distributions
Yu Zhang, James Z. Wang, and Jia Li
The Pennsylvania State University, University Park, PA 16802
Abstract:
The trend of analyzing big data in artificial intelligence demands
highly-scalable machine learning algorithms, among which clustering is
a fundamental and arguably the most widely applied method. To extend
the applications of regular vector-based clustering algorithms, the
Discrete Distribution (D2) clustering algorithm has been developed,
aiming at clustering data represented by bags of weighted vectors
which are well adopted data signatures in many emerging information
retrieval and multimedia learning applications. However, the high
computational complexity of D2-clustering limits its impact in solving
massive learning problems. Here we present the parallel D2-clustering
(PD2-clustering) algorithm with substantially improved scalability. We
developed a hierarchical multi-pass algorithm structure for parallel
computing in order to achieve a balance between the individual-node
computation and the integration process of the algorithm. Experiments
and extensive comparisons between PD2-clustering and other clustering
algorithms are conducted on synthetic datasets. The results show that
the proposed parallel algorithm achieves significant speed-up with
minor accuracy loss. We apply PD2-clustering to image concept
learning. In addition, by extending D2-clustering to symbolic data, we
apply PD2-clustering to protein sequence clustering. For both
applications, we demonstrate the high competitiveness of our new
algorithm in comparison with other state-of-the-art methods.
Full Paper in Color
(PDF, 3.4MB)
Full Paper from the ACM
(link)
On-line Info (more to be made available)
Citation:
Yu Zhang, James Z. Wang and Jia Li, ``Parallel Massive Clustering of
Discrete Distributions,'' ACM Transactions on Multimedia Computing,
Communications and Applications, vol. 11, no. 4, article 49,
pp. 49:1-24 and appendix:1-6, April 2015.
Copyright 2015 ACM.
Personal use of this
material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists, or
to reuse any copyrighted component of this work in other works, must
be obtained from the ACM.
Last Modified:
June 5, 2015
© 2015