Consensus clustering

Abstract

We address the consensus clustering problem of combining multiple partitions of a set of objects into a single consolidated partition. The input here is a set of cluster labelings and we do not access the original data or clustering algorithms that determine these partitions. After introducing the distribution-based view of partitions, we propose a series of entropy-based distance functions for comparing various partitions. Given a candidate partition set, consensus clustering is then formalized as an optimization problem of searching for a centroid partition with the smallest distance to that set. In addition to directly selecting the local centroid candidate, we also present two combining methods based on similarity-based graph partitioning. Under certain conditions, the centroid partition is likely to be top/middle-ranked in terms of closeness to the true partition. Finally we evaluate its effectiveness on both artificial and real datasets, with candidates from either the full space or the subspace.

Keywords

Cluster analysis centroid clustering consensus clustering entropy distance function