3-583-strehl - Journal of Machine Learning Research 3(2002...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Journal of Machine Learning Research 3 (2002) 583-617 Submitted 4/02; Published 12/02 Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions Alexander Strehl [email protected] Joydeep Ghosh [email protected] Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712, USA Editor: Claire Cardie Abstract This paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that deter- mined these partitionings. We first identify several application scenarios for the resultant ‘knowledge reuse’ framework that we call cluster ensembles . The cluster ensemble prob- lem is then formalized as a combinatorial optimization problem in terms of shared mutual information. In addition to a direct maximization approach, we propose three effective and efficient techniques for obtaining high-quality combiners (consensus functions). The first combiner induces a similarity measure from the partitionings and then reclusters the objects. The second combiner is based on hypergraph partitioning. The third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. Due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. We evaluate the ef- fectiveness of cluster ensembles in three qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. Promising results are obtained in all three situations for synthetic as well as real data-sets. Keywords: cluster analysis, clustering, partitioning, unsupervised learning, multi-learner systems, ensemble, mutual information, consensus functions, knowledge reuse 1. Introduction The notion of integrating multiple data sources and/or learned models is found in sev- eral disciplines, for example, the combining of estimators in econometrics (Granger, 1989), evidences in rule-based systems (Barnett, 1981) and multi-sensor data fusion (Dasarathy, 1994). A simple but effective type of multi-learner system is an ensemble in which each component learner (typically a regressor or classifier) tries to solve the same task. While early studies on combining multiple rankings, such as the works by Borda and Condorcet, pre-date the French Revolution (Ghosh, 2002a), this area noticeably came to life in the past c ± 2002 Alexander Strehl and Joydeep Ghosh.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Strehl and Ghosh decade, and now even boasts its own series of dedicated workshops (Kittler and Roli, 2002).
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 35

3-583-strehl - Journal of Machine Learning Research 3(2002...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online