Biologists often employ clustering
techniques in the explorative phase of microarray
data analysis to discover relevant biological groupings. Given the
availability of numerous clustering algorithms in the machine learning
literature, an user might want to select one that
performs best to his/her data set or application. Various validation
measures have been proposed over the years to judge the quality of clusters
produced by a given clustering algorithm including their biological
relevance. Unfortunately, a given clustering
algorithm can perform poorly under one validation measure while
outperforming many other algorithms under another validation measure. A
manual synthesis of results from multiple validation measures is nearly
impossible in practice, especially, when a large number of clustering
algorithms are to be compared using several measures. An automated and
objective way of reconciling the rankings is needed.
Using a Monte Carlo cross-entropy
algorithm, we successfully combine the ranks of a set of clustering
algorithms under consideration via a weighted aggregation that optimizes a
distance criterion. The proposed weighted rank aggregation allows for a far
more objective and automated assessment of clustering results than a simple
visual inspection. We illustrate our procedure using one simulated, as well
as, three real gene expression data sets from various platforms where we
rank a total of eleven clustering algorithms using a combined examination
of ten different validation measures. The aggregate rankings were found for
a given number of clusters k and also for an entire range of k. Generally speaking,
UPGMA and SOTA, each with the right combination of dissimilarity measure,
emerge as the overall top performers.
|