Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction

<p>Abstract</p> <p>Background</p> <p>In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consultin...

Full description

Bibliographic Details
Main Authors: Boulesteix Anne-Laure, Strobl Carolin
Format: Article
Language:English
Published: BMC 2009-12-01
Series:BMC Medical Research Methodology
Online Access:http://www.biomedcentral.com/1471-2288/9/85
id doaj-92d5b0d7c15d45faa83b0b3fa420667b
record_format Article
spelling doaj-92d5b0d7c15d45faa83b0b3fa420667b2020-11-24T22:10:05ZengBMCBMC Medical Research Methodology1471-22882009-12-01918510.1186/1471-2288-9-85Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional predictionBoulesteix Anne-LaureStrobl Carolin<p>Abstract</p> <p>Background</p> <p>In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.</p> <p>Methods</p> <p>In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure.</p> <p>Results</p> <p>We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.</p> <p>Conclusions</p> <p>The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on <it>permuted uninformative </it>predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.</p> http://www.biomedcentral.com/1471-2288/9/85
collection DOAJ
language English
format Article
sources DOAJ
author Boulesteix Anne-Laure
Strobl Carolin
spellingShingle Boulesteix Anne-Laure
Strobl Carolin
Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
BMC Medical Research Methodology
author_facet Boulesteix Anne-Laure
Strobl Carolin
author_sort Boulesteix Anne-Laure
title Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_short Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_full Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_fullStr Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_full_unstemmed Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_sort optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
publisher BMC
series BMC Medical Research Methodology
issn 1471-2288
publishDate 2009-12-01
description <p>Abstract</p> <p>Background</p> <p>In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.</p> <p>Methods</p> <p>In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure.</p> <p>Results</p> <p>We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.</p> <p>Conclusions</p> <p>The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on <it>permuted uninformative </it>predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.</p>
url http://www.biomedcentral.com/1471-2288/9/85
work_keys_str_mv AT boulesteixannelaure optimalclassifierselectionandnegativebiasinerrorrateestimationanempiricalstudyonhighdimensionalprediction
AT stroblcarolin optimalclassifierselectionandnegativebiasinerrorrateestimationanempiricalstudyonhighdimensionalprediction
_version_ 1725809481965633536