Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data

The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression...

Full description

Bibliographic Details
Main Author:	Lenz, Lauren Holt
Format:	Others
Published:	DigitalCommons@USU 2018
Subjects:	RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability
Online Access:	https://digitalcommons.usu.edu/etd/7392 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd

id	ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-8509
record_format	oai_dc
spelling	ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-85092019-10-13T06:03:20Z Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data Lenz, Lauren Holt The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression of genes is though a process called RNA-Seq, that takes physical tissue samples and maps gene products and fragments in the sample back to the gene that created it, resulting in a large read-count matrix with genes in the rows and a column for each sample. The read-counts for tumor and normal samples are then compared in a process called differential expression analysis. However, normalization of these read-counts is a necessary pre-processing step, in order to account for differences in the read-count values due to non-expression related variables. It is common in recent RNA-Seq normalization methods to also account for gene-level covariates, namely gene length in base pairs and GC-content, the proportion of bases in the gene that are Guanine and Cytosine. Here a colorectal cancer RNA-Seq read-count data set comprised of 30,220 genes and 378 samples is examined. Two of the normalization methods that account for gene length and GC-content, CQN and EDASeq, are extended to account for protein coding status as a third gene-level covariate. The binary nature of protein coding status results in unique computation issues. The results of using the normalized read counts from CQN, EDASeq, and four new normalization methods are used for differential expression analysis via the nonparametric Wilcoxon Rank-Sum Test as well as the lme4 pipeline that produces per-gene models based on a negative binomial distribution. The resulting differential expression results are compared for two genes of interest in colorectal cancer, APC and CTNNB1, both of the WNT signaling pathway. 2018-12-01T08:00:00Z text application/pdf https://digitalcommons.usu.edu/etd/7392 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd Copyright for this work is held by the author. Transmission or reproduction of materials protected by copyright beyond that allowed by fair use requires the written permission of the copyright owners. Works not in the public domain cannot be commercially exploited without permission of the copyright owner. Responsibility for any use rests exclusively with the user. For more information contact digitalcommons@usu.edu. All Graduate Theses and Dissertations DigitalCommons@USU RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability
collection	NDLTD
format	Others
sources	NDLTD
topic	RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability
spellingShingle	RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability Lenz, Lauren Holt Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
description	The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression of genes is though a process called RNA-Seq, that takes physical tissue samples and maps gene products and fragments in the sample back to the gene that created it, resulting in a large read-count matrix with genes in the rows and a column for each sample. The read-counts for tumor and normal samples are then compared in a process called differential expression analysis. However, normalization of these read-counts is a necessary pre-processing step, in order to account for differences in the read-count values due to non-expression related variables. It is common in recent RNA-Seq normalization methods to also account for gene-level covariates, namely gene length in base pairs and GC-content, the proportion of bases in the gene that are Guanine and Cytosine. Here a colorectal cancer RNA-Seq read-count data set comprised of 30,220 genes and 378 samples is examined. Two of the normalization methods that account for gene length and GC-content, CQN and EDASeq, are extended to account for protein coding status as a third gene-level covariate. The binary nature of protein coding status results in unique computation issues. The results of using the normalized read counts from CQN, EDASeq, and four new normalization methods are used for differential expression analysis via the nonparametric Wilcoxon Rank-Sum Test as well as the lme4 pipeline that produces per-gene models based on a negative binomial distribution. The resulting differential expression results are compared for two genes of interest in colorectal cancer, APC and CTNNB1, both of the WNT signaling pathway.
author	Lenz, Lauren Holt
author_facet	Lenz, Lauren Holt
author_sort	Lenz, Lauren Holt
title	Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_short	Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_full	Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_fullStr	Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_full_unstemmed	Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_sort	statistical methods to account for gene-level covariates in normalization of high-dimensional read-count data
publisher	DigitalCommons@USU
publishDate	2018
url	https://digitalcommons.usu.edu/etd/7392 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd
work_keys_str_mv	AT lenzlaurenholt statisticalmethodstoaccountforgenelevelcovariatesinnormalizationofhighdimensionalreadcountdata
_version_	1719267886850113536

Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data

Similar Items