Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
<p>Abstract</p> <p>Background</p> <p>Michiels <it>et al. </it>(Lancet 2005; 365: 488–92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2008-08-01
|
Series: | BMC Genomics |
Online Access: | http://www.biomedcentral.com/1471-2164/9/375 |
id |
doaj-052ce508bd944b2a857c853b84f87ed8 |
---|---|
record_format |
Article |
spelling |
doaj-052ce508bd944b2a857c853b84f87ed82020-11-25T00:55:22ZengBMCBMC Genomics1471-21642008-08-019137510.1186/1471-2164-9-375Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stabilityvan de Vijver Marc JHorlings Hugo MReyal Fabienvan Vliet Martin HReinders Marcel JTWessels Lodewyk FA<p>Abstract</p> <p>Background</p> <p>Michiels <it>et al. </it>(Lancet 2005; 365: 488–92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The genes most frequently identified in the separate resamplings were put forward as a 'gold standard'. On a higher level, breast cancer datasets collected by different institutions can be considered as resamplings from the underlying breast cancer population. The limited overlap between published prognostic signatures confirms the trend of signature instability identified by the resampling strategy. Six breast cancer datasets, totaling 947 samples, all measured on the Affymetrix platform, are currently available. This provides a unique opportunity to employ a substantial dataset to investigate the effects of pooling datasets on classifier accuracy, signature stability and enrichment of functional categories.</p> <p>Results</p> <p>We show that the resampling strategy produces a suboptimal ranking of genes, which can not be considered to be a 'gold standard'. When pooling breast cancer datasets, we observed a synergetic effect on the classification performance in 73% of the cases. We also observe a significant positive correlation between the number of datasets that is pooled, the validation performance, the number of genes selected, and the enrichment of specific functional categories. In addition, we have evaluated the support for five explanations that have been postulated for the limited overlap of signatures.</p> <p>Conclusion</p> <p>The limited overlap of current signature genes can be attributed to small sample size. Pooling datasets results in more accurate classification and a convergence of signature genes. We therefore advocate the analysis of new data within the context of a compendium, rather than analysis in isolation.</p> http://www.biomedcentral.com/1471-2164/9/375 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
van de Vijver Marc J Horlings Hugo M Reyal Fabien van Vliet Martin H Reinders Marcel JT Wessels Lodewyk FA |
spellingShingle |
van de Vijver Marc J Horlings Hugo M Reyal Fabien van Vliet Martin H Reinders Marcel JT Wessels Lodewyk FA Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability BMC Genomics |
author_facet |
van de Vijver Marc J Horlings Hugo M Reyal Fabien van Vliet Martin H Reinders Marcel JT Wessels Lodewyk FA |
author_sort |
van de Vijver Marc J |
title |
Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability |
title_short |
Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability |
title_full |
Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability |
title_fullStr |
Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability |
title_full_unstemmed |
Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability |
title_sort |
pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability |
publisher |
BMC |
series |
BMC Genomics |
issn |
1471-2164 |
publishDate |
2008-08-01 |
description |
<p>Abstract</p> <p>Background</p> <p>Michiels <it>et al. </it>(Lancet 2005; 365: 488–92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The genes most frequently identified in the separate resamplings were put forward as a 'gold standard'. On a higher level, breast cancer datasets collected by different institutions can be considered as resamplings from the underlying breast cancer population. The limited overlap between published prognostic signatures confirms the trend of signature instability identified by the resampling strategy. Six breast cancer datasets, totaling 947 samples, all measured on the Affymetrix platform, are currently available. This provides a unique opportunity to employ a substantial dataset to investigate the effects of pooling datasets on classifier accuracy, signature stability and enrichment of functional categories.</p> <p>Results</p> <p>We show that the resampling strategy produces a suboptimal ranking of genes, which can not be considered to be a 'gold standard'. When pooling breast cancer datasets, we observed a synergetic effect on the classification performance in 73% of the cases. We also observe a significant positive correlation between the number of datasets that is pooled, the validation performance, the number of genes selected, and the enrichment of specific functional categories. In addition, we have evaluated the support for five explanations that have been postulated for the limited overlap of signatures.</p> <p>Conclusion</p> <p>The limited overlap of current signature genes can be attributed to small sample size. Pooling datasets results in more accurate classification and a convergence of signature genes. We therefore advocate the analysis of new data within the context of a compendium, rather than analysis in isolation.</p> |
url |
http://www.biomedcentral.com/1471-2164/9/375 |
work_keys_str_mv |
AT vandevijvermarcj poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability AT horlingshugom poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability AT reyalfabien poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability AT vanvlietmartinh poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability AT reindersmarceljt poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability AT wesselslodewykfa poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability |
_version_ |
1725230552761499648 |