Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies

Abstract Background High-throughput RNA sequencing (RNA-seq) has evolved as an important analytical tool in molecular biology. Although the utility and importance of this technique have grown, uncertainties regarding the proper analysis of RNA-seq data remain. Of primary concern, there is no consens...

Full description

Bibliographic Details
Main Authors: Xiaohong Li, Nigel G. F. Cooper, Timothy E. O’Toole, Eric C. Rouchka
Format: Article
Language:English
Published: BMC 2020-01-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-020-6502-7
id doaj-5211d3632bae457abe93c3e8b6f179bd
record_format Article
spelling doaj-5211d3632bae457abe93c3e8b6f179bd2021-01-31T16:11:53ZengBMCBMC Genomics1471-21642020-01-0121111710.1186/s12864-020-6502-7Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studiesXiaohong Li0Nigel G. F. Cooper1Timothy E. O’Toole2Eric C. Rouchka3Department of Anatomical Sciences and Neurobiology, University of LouisvilleDepartment of Anatomical Sciences and Neurobiology, University of LouisvilleEnvirome Institute, University of LouisvilleDepartment of Computer Science and Engineering, University of LouisvilleAbstract Background High-throughput RNA sequencing (RNA-seq) has evolved as an important analytical tool in molecular biology. Although the utility and importance of this technique have grown, uncertainties regarding the proper analysis of RNA-seq data remain. Of primary concern, there is no consensus regarding which normalization and statistical methods are the most appropriate for analyzing this data. The lack of standardized analytical methods leads to uncertainties in data interpretation and study reproducibility, especially with studies reporting high false discovery rates. In this study, we compared a recently developed normalization method, UQ-pgQ2, with three of the most frequently used alternatives including RLE (relative log estimate), TMM (Trimmed-mean M values) and UQ (upper quartile normalization) in the analysis of RNA-seq data. We evaluated the performance of these methods for gene-level differential expression analysis by considering the factors, including: 1) normalization combined with the choice of a Wald test from DESeq2 and an exact test/QL (Quasi-likelihood) F-Test from edgeR; 2) sample sizes in two balanced two-group comparisons; and 3) sequencing read depths. Results Using the MAQC RNA-seq datasets with small sample replicates, we found that UQ-pgQ2 normalization combined with an exact test can achieve better performance in term of power and specificity in differential gene expression analysis. However, using an intra-group analysis of false positives from real and simulated data, we found that a Wald test performs better than an exact test when the number of sample replicates is large and that a QL F-test performs the best given sample sizes of 5, 10 and 15 for any normalization. The RLE, TMM and UQ methods performed similarly given a desired sample size. Conclusion We found the UQ-pgQ2 method combined with an exact test/QL F-test is the best choice in order to control false positives when the sample size is small. When the sample size is large, UQ-pgQ2 with a QL F-test is a better choice for the type I error control in an intra-group analysis. We observed read depths have a minimal impact for differential gene expression analysis based on the simulated data.https://doi.org/10.1186/s12864-020-6502-7RNA-seqSample sizesNormalizationStatistical testDifferentially expressed genes
collection DOAJ
language English
format Article
sources DOAJ
author Xiaohong Li
Nigel G. F. Cooper
Timothy E. O’Toole
Eric C. Rouchka
spellingShingle Xiaohong Li
Nigel G. F. Cooper
Timothy E. O’Toole
Eric C. Rouchka
Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies
BMC Genomics
RNA-seq
Sample sizes
Normalization
Statistical test
Differentially expressed genes
author_facet Xiaohong Li
Nigel G. F. Cooper
Timothy E. O’Toole
Eric C. Rouchka
author_sort Xiaohong Li
title Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies
title_short Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies
title_full Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies
title_fullStr Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies
title_full_unstemmed Choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for RNA-seq studies
title_sort choice of library size normalization and statistical methods for differential gene expression analysis in balanced two-group comparisons for rna-seq studies
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2020-01-01
description Abstract Background High-throughput RNA sequencing (RNA-seq) has evolved as an important analytical tool in molecular biology. Although the utility and importance of this technique have grown, uncertainties regarding the proper analysis of RNA-seq data remain. Of primary concern, there is no consensus regarding which normalization and statistical methods are the most appropriate for analyzing this data. The lack of standardized analytical methods leads to uncertainties in data interpretation and study reproducibility, especially with studies reporting high false discovery rates. In this study, we compared a recently developed normalization method, UQ-pgQ2, with three of the most frequently used alternatives including RLE (relative log estimate), TMM (Trimmed-mean M values) and UQ (upper quartile normalization) in the analysis of RNA-seq data. We evaluated the performance of these methods for gene-level differential expression analysis by considering the factors, including: 1) normalization combined with the choice of a Wald test from DESeq2 and an exact test/QL (Quasi-likelihood) F-Test from edgeR; 2) sample sizes in two balanced two-group comparisons; and 3) sequencing read depths. Results Using the MAQC RNA-seq datasets with small sample replicates, we found that UQ-pgQ2 normalization combined with an exact test can achieve better performance in term of power and specificity in differential gene expression analysis. However, using an intra-group analysis of false positives from real and simulated data, we found that a Wald test performs better than an exact test when the number of sample replicates is large and that a QL F-test performs the best given sample sizes of 5, 10 and 15 for any normalization. The RLE, TMM and UQ methods performed similarly given a desired sample size. Conclusion We found the UQ-pgQ2 method combined with an exact test/QL F-test is the best choice in order to control false positives when the sample size is small. When the sample size is large, UQ-pgQ2 with a QL F-test is a better choice for the type I error control in an intra-group analysis. We observed read depths have a minimal impact for differential gene expression analysis based on the simulated data.
topic RNA-seq
Sample sizes
Normalization
Statistical test
Differentially expressed genes
url https://doi.org/10.1186/s12864-020-6502-7
work_keys_str_mv AT xiaohongli choiceoflibrarysizenormalizationandstatisticalmethodsfordifferentialgeneexpressionanalysisinbalancedtwogroupcomparisonsforrnaseqstudies
AT nigelgfcooper choiceoflibrarysizenormalizationandstatisticalmethodsfordifferentialgeneexpressionanalysisinbalancedtwogroupcomparisonsforrnaseqstudies
AT timothyeotoole choiceoflibrarysizenormalizationandstatisticalmethodsfordifferentialgeneexpressionanalysisinbalancedtwogroupcomparisonsforrnaseqstudies
AT ericcrouchka choiceoflibrarysizenormalizationandstatisticalmethodsfordifferentialgeneexpressionanalysisinbalancedtwogroupcomparisonsforrnaseqstudies
_version_ 1724316726764503040