Speeding up the Consensus Clustering methodology for microarray data analysis
<p>Abstract</p> <p>Background</p> <p>The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2011-01-01
|
Series: | Algorithms for Molecular Biology |
Online Access: | http://www.almob.org/content/6/1/1 |
id |
doaj-cbf1d67da9ea4425a3175e56603f8317 |
---|---|
record_format |
Article |
spelling |
doaj-cbf1d67da9ea4425a3175e56603f83172020-11-25T01:03:49ZengBMCAlgorithms for Molecular Biology1748-71882011-01-0161110.1186/1748-7188-6-1Speeding up the Consensus Clustering methodology for microarray data analysisUtro FilippoGiancarlo Raffaele<p>Abstract</p> <p>Background</p> <p>The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of <monospace>Consensus</monospace> (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, <monospace>Consensus</monospace> is a natural candidate for a speed-up.</p> <p>Results</p> <p>Since the time-precision performance of <monospace>Consensus</monospace> depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for <monospace>Consensus</monospace>. That is, the closely related algorithm <monospace>FC</monospace> (Fast Consensus) that would have the same precision as <monospace>Consensus</monospace> with a substantially better time performance. The performance of <monospace>FC</monospace> has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, <monospace>FC</monospace> turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of <monospace>Consensus</monospace>. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by <monospace>Consensus</monospace>. We have also experimented with the use of <monospace>Consensus</monospace> and <monospace>FC</monospace> in conjunction with <monospace>NMF</monospace> (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although <monospace>NMF</monospace> is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about <monospace>NMF</monospace>, shedding further light on its merits and limitations.</p> <p>Conclusions</p> <p>In summary, <monospace>FC</monospace> with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.</p> http://www.almob.org/content/6/1/1 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Utro Filippo Giancarlo Raffaele |
spellingShingle |
Utro Filippo Giancarlo Raffaele Speeding up the Consensus Clustering methodology for microarray data analysis Algorithms for Molecular Biology |
author_facet |
Utro Filippo Giancarlo Raffaele |
author_sort |
Utro Filippo |
title |
Speeding up the Consensus Clustering methodology for microarray data analysis |
title_short |
Speeding up the Consensus Clustering methodology for microarray data analysis |
title_full |
Speeding up the Consensus Clustering methodology for microarray data analysis |
title_fullStr |
Speeding up the Consensus Clustering methodology for microarray data analysis |
title_full_unstemmed |
Speeding up the Consensus Clustering methodology for microarray data analysis |
title_sort |
speeding up the consensus clustering methodology for microarray data analysis |
publisher |
BMC |
series |
Algorithms for Molecular Biology |
issn |
1748-7188 |
publishDate |
2011-01-01 |
description |
<p>Abstract</p> <p>Background</p> <p>The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of <monospace>Consensus</monospace> (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, <monospace>Consensus</monospace> is a natural candidate for a speed-up.</p> <p>Results</p> <p>Since the time-precision performance of <monospace>Consensus</monospace> depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for <monospace>Consensus</monospace>. That is, the closely related algorithm <monospace>FC</monospace> (Fast Consensus) that would have the same precision as <monospace>Consensus</monospace> with a substantially better time performance. The performance of <monospace>FC</monospace> has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, <monospace>FC</monospace> turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of <monospace>Consensus</monospace>. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by <monospace>Consensus</monospace>. We have also experimented with the use of <monospace>Consensus</monospace> and <monospace>FC</monospace> in conjunction with <monospace>NMF</monospace> (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although <monospace>NMF</monospace> is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about <monospace>NMF</monospace>, shedding further light on its merits and limitations.</p> <p>Conclusions</p> <p>In summary, <monospace>FC</monospace> with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.</p> |
url |
http://www.almob.org/content/6/1/1 |
work_keys_str_mv |
AT utrofilippo speedinguptheconsensusclusteringmethodologyformicroarraydataanalysis AT giancarloraffaele speedinguptheconsensusclusteringmethodologyformicroarraydataanalysis |
_version_ |
1725199343977234432 |