Transforming RNA-Seq data to improve the performance of prognostic gene signatures.

Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk...

Full description

Bibliographic Details
Main Authors: Isabella Zwiener, Barbara Frisch, Harald Binder
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2014-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC3885686?pdf=render
id doaj-fccfd1bb32af4dada8032d18a07e0e32
record_format Article
spelling doaj-fccfd1bb32af4dada8032d18a07e0e322020-11-25T01:34:36ZengPublic Library of Science (PLoS)PLoS ONE1932-62032014-01-0191e8515010.1371/journal.pone.0085150Transforming RNA-Seq data to improve the performance of prognostic gene signatures.Isabella ZwienerBarbara FrischHarald BinderGene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques.http://europepmc.org/articles/PMC3885686?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Isabella Zwiener
Barbara Frisch
Harald Binder
spellingShingle Isabella Zwiener
Barbara Frisch
Harald Binder
Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
PLoS ONE
author_facet Isabella Zwiener
Barbara Frisch
Harald Binder
author_sort Isabella Zwiener
title Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
title_short Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
title_full Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
title_fullStr Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
title_full_unstemmed Transforming RNA-Seq data to improve the performance of prognostic gene signatures.
title_sort transforming rna-seq data to improve the performance of prognostic gene signatures.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2014-01-01
description Gene expression measurements have successfully been used for building prognostic signatures, i.e for identifying a short list of important genes that can predict patient outcome. Mostly microarray measurements have been considered, and there is little advice available for building multivariable risk prediction models from RNA-Seq data. We specifically consider penalized regression techniques, such as the lasso and componentwise boosting, which can simultaneously consider all measurements and provide both, multivariable regression models for prediction and automated variable selection. However, they might be affected by the typical skewness, mean-variance-dependency or extreme values of RNA-Seq covariates and therefore could benefit from transformations of the latter. In an analytical part, we highlight preferential selection of covariates with large variances, which is problematic due to the mean-variance dependency of RNA-Seq data. In a simulation study, we compare different transformations of RNA-Seq data for potentially improving detection of important genes. Specifically, we consider standardization, the log transformation, a variance-stabilizing transformation, the Box-Cox transformation, and rank-based transformations. In addition, the prediction performance for real data from patients with kidney cancer and acute myeloid leukemia is considered. We show that signature size, identification performance, and prediction performance critically depend on the choice of a suitable transformation. Rank-based transformations perform well in all scenarios and can even outperform complex variance-stabilizing approaches. Generally, the results illustrate that the distribution and potential transformations of RNA-Seq data need to be considered as a critical step when building risk prediction models by penalized regression techniques.
url http://europepmc.org/articles/PMC3885686?pdf=render
work_keys_str_mv AT isabellazwiener transformingrnaseqdatatoimprovetheperformanceofprognosticgenesignatures
AT barbarafrisch transformingrnaseqdatatoimprovetheperformanceofprognosticgenesignatures
AT haraldbinder transformingrnaseqdatatoimprovetheperformanceofprognosticgenesignatures
_version_ 1725070845643063296