Parallel Massive Clustering of Discrete Distributions

Yu Zhang, James Z. Wang, and Jia Li
The Pennsylvania State University, University Park, PA 16802
Abstract:

The trend of analyzing big data in artificial intelligence demands highly-scalable machine learning algorithms, among which clustering is a fundamental and arguably the most widely applied method. To extend the applications of regular vector-based clustering algorithms, the Discrete Distribution (D2) clustering algorithm has been developed, aiming at clustering data represented by bags of weighted vectors which are well adopted data signatures in many emerging information retrieval and multimedia learning applications. However, the high computational complexity of D2-clustering limits its impact in solving massive learning problems. Here we present the parallel D2-clustering (PD2-clustering) algorithm with substantially improved scalability. We developed a hierarchical multi-pass algorithm structure for parallel computing in order to achieve a balance between the individual-node computation and the integration process of the algorithm. Experiments and extensive comparisons between PD2-clustering and other clustering algorithms are conducted on synthetic datasets. The results show that the proposed parallel algorithm achieves significant speed-up with minor accuracy loss. We apply PD2-clustering to image concept learning. In addition, by extending D2-clustering to symbolic data, we apply PD2-clustering to protein sequence clustering. For both applications, we demonstrate the high competitiveness of our new algorithm in comparison with other state-of-the-art methods.


Full Paper in Color
(PDF, 3.4MB)

Full Paper from the ACM
(link)

On-line Info (more to be made available)


Citation: Yu Zhang, James Z. Wang and Jia Li, ``Parallel Massive Clustering of Discrete Distributions,'' ACM Transactions on Multimedia Computing, Communications and Applications, vol. 11, no. 4, article 49, pp. 49:1-24 and appendix:1-6, April 2015.

Copyright 2015 ACM. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the ACM.

Last Modified: June 5, 2015
© 2015