Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms

<p>Abstract</p> <p>Background</p> <p>Data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues rel...

Full description

Bibliographic Details
Main Authors: Guo Yu, Graber Armin, McBurney Robert N, Balasubramanian Raji
Format: Article
Language:English
Published: BMC 2010-09-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/11/447
id doaj-fe9d5de05b5343858f741d8070c38721
record_format Article
spelling doaj-fe9d5de05b5343858f741d8070c387212020-11-25T02:47:36ZengBMCBMC Bioinformatics1471-21052010-09-0111144710.1186/1471-2105-11-447Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithmsGuo YuGraber ArminMcBurney Robert NBalasubramanian Raji<p>Abstract</p> <p>Background</p> <p>Data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.</p> <p>Results</p> <p>The analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (<it>MVpower</it>) that implements the simulation strategy proposed in this paper.</p> <p>Conclusion</p> <p>No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.</p> http://www.biomedcentral.com/1471-2105/11/447
collection DOAJ
language English
format Article
sources DOAJ
author Guo Yu
Graber Armin
McBurney Robert N
Balasubramanian Raji
spellingShingle Guo Yu
Graber Armin
McBurney Robert N
Balasubramanian Raji
Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
BMC Bioinformatics
author_facet Guo Yu
Graber Armin
McBurney Robert N
Balasubramanian Raji
author_sort Guo Yu
title Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_short Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_full Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_fullStr Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_full_unstemmed Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
title_sort sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2010-09-01
description <p>Abstract</p> <p>Background</p> <p>Data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques.</p> <p>Results</p> <p>The analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (<it>MVpower</it>) that implements the simulation strategy proposed in this paper.</p> <p>Conclusion</p> <p>No single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.</p>
url http://www.biomedcentral.com/1471-2105/11/447
work_keys_str_mv AT guoyu samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms
AT graberarmin samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms
AT mcburneyrobertn samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms
AT balasubramanianraji samplesizeandstatisticalpowerconsiderationsinhighdimensionalitydatasettingsacomparativestudyofclassificationalgorithms
_version_ 1724752651618353152