Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

<p>Abstract</p> <p>Background</p> <p>The widely used <it>k </it>top scoring pair (<it>k</it>-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selecti...

Full description

Bibliographic Details
Main Authors: Zhu Qifu, Ray Surajit, Shi Ping, Kon Mark A
Format: Article
Language:English
Published: BMC 2011-09-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/12/375
id doaj-4bcbd40e7e7f492ca8661049bd9ce52c
record_format Article
spelling doaj-4bcbd40e7e7f492ca8661049bd9ce52c2020-11-25T00:23:16ZengBMCBMC Bioinformatics1471-21052011-09-0112137510.1186/1471-2105-12-375Top scoring pairs for feature selection in machine learning and applications to cancer outcome predictionZhu QifuRay SurajitShi PingKon Mark A<p>Abstract</p> <p>Background</p> <p>The widely used <it>k </it>top scoring pair (<it>k</it>-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the <it>k</it>-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.</p> <p>Results</p> <p>We developed an approach integrating the <it>k</it>-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of <it>k</it>-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (<it>k</it>-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which <it>k</it>-TSP+SVM outperforms <it>k</it>-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets</p> <p>Conclusions</p> <p>The <it>k</it>-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with <it>k</it>-TSP ranking algorithm outperforms <it>k</it>-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.</p> http://www.biomedcentral.com/1471-2105/12/375
collection DOAJ
language English
format Article
sources DOAJ
author Zhu Qifu
Ray Surajit
Shi Ping
Kon Mark A
spellingShingle Zhu Qifu
Ray Surajit
Shi Ping
Kon Mark A
Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
BMC Bioinformatics
author_facet Zhu Qifu
Ray Surajit
Shi Ping
Kon Mark A
author_sort Zhu Qifu
title Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
title_short Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
title_full Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
title_fullStr Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
title_full_unstemmed Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
title_sort top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2011-09-01
description <p>Abstract</p> <p>Background</p> <p>The widely used <it>k </it>top scoring pair (<it>k</it>-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the <it>k</it>-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.</p> <p>Results</p> <p>We developed an approach integrating the <it>k</it>-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of <it>k</it>-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (<it>k</it>-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which <it>k</it>-TSP+SVM outperforms <it>k</it>-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets</p> <p>Conclusions</p> <p>The <it>k</it>-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with <it>k</it>-TSP ranking algorithm outperforms <it>k</it>-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.</p>
url http://www.biomedcentral.com/1471-2105/12/375
work_keys_str_mv AT zhuqifu topscoringpairsforfeatureselectioninmachinelearningandapplicationstocanceroutcomeprediction
AT raysurajit topscoringpairsforfeatureselectioninmachinelearningandapplicationstocanceroutcomeprediction
AT shiping topscoringpairsforfeatureselectioninmachinelearningandapplicationstocanceroutcomeprediction
AT konmarka topscoringpairsforfeatureselectioninmachinelearningandapplicationstocanceroutcomeprediction
_version_ 1725357859037773824