Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data

The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression...

Full description

Bibliographic Details
Main Author: Lenz, Lauren Holt
Format: Others
Published: DigitalCommons@USU 2018
Subjects:
Online Access:https://digitalcommons.usu.edu/etd/7392
https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd
id ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-8509
record_format oai_dc
spelling ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-85092019-10-13T06:03:20Z Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data Lenz, Lauren Holt The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression of genes is though a process called RNA-Seq, that takes physical tissue samples and maps gene products and fragments in the sample back to the gene that created it, resulting in a large read-count matrix with genes in the rows and a column for each sample. The read-counts for tumor and normal samples are then compared in a process called differential expression analysis. However, normalization of these read-counts is a necessary pre-processing step, in order to account for differences in the read-count values due to non-expression related variables. It is common in recent RNA-Seq normalization methods to also account for gene-level covariates, namely gene length in base pairs and GC-content, the proportion of bases in the gene that are Guanine and Cytosine. Here a colorectal cancer RNA-Seq read-count data set comprised of 30,220 genes and 378 samples is examined. Two of the normalization methods that account for gene length and GC-content, CQN and EDASeq, are extended to account for protein coding status as a third gene-level covariate. The binary nature of protein coding status results in unique computation issues. The results of using the normalized read counts from CQN, EDASeq, and four new normalization methods are used for differential expression analysis via the nonparametric Wilcoxon Rank-Sum Test as well as the lme4 pipeline that produces per-gene models based on a negative binomial distribution. The resulting differential expression results are compared for two genes of interest in colorectal cancer, APC and CTNNB1, both of the WNT signaling pathway. 2018-12-01T08:00:00Z text application/pdf https://digitalcommons.usu.edu/etd/7392 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd Copyright for this work is held by the author. Transmission or reproduction of materials protected by copyright beyond that allowed by fair use requires the written permission of the copyright owners. Works not in the public domain cannot be commercially exploited without permission of the copyright owner. Responsibility for any use rests exclusively with the user. For more information contact digitalcommons@usu.edu. All Graduate Theses and Dissertations DigitalCommons@USU RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability
collection NDLTD
format Others
sources NDLTD
topic RNA-Seq
normalization
gene-level covariates
high-dimensional read-counts
Statistics and Probability
spellingShingle RNA-Seq
normalization
gene-level covariates
high-dimensional read-counts
Statistics and Probability
Lenz, Lauren Holt
Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
description The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression of genes is though a process called RNA-Seq, that takes physical tissue samples and maps gene products and fragments in the sample back to the gene that created it, resulting in a large read-count matrix with genes in the rows and a column for each sample. The read-counts for tumor and normal samples are then compared in a process called differential expression analysis. However, normalization of these read-counts is a necessary pre-processing step, in order to account for differences in the read-count values due to non-expression related variables. It is common in recent RNA-Seq normalization methods to also account for gene-level covariates, namely gene length in base pairs and GC-content, the proportion of bases in the gene that are Guanine and Cytosine. Here a colorectal cancer RNA-Seq read-count data set comprised of 30,220 genes and 378 samples is examined. Two of the normalization methods that account for gene length and GC-content, CQN and EDASeq, are extended to account for protein coding status as a third gene-level covariate. The binary nature of protein coding status results in unique computation issues. The results of using the normalized read counts from CQN, EDASeq, and four new normalization methods are used for differential expression analysis via the nonparametric Wilcoxon Rank-Sum Test as well as the lme4 pipeline that produces per-gene models based on a negative binomial distribution. The resulting differential expression results are compared for two genes of interest in colorectal cancer, APC and CTNNB1, both of the WNT signaling pathway.
author Lenz, Lauren Holt
author_facet Lenz, Lauren Holt
author_sort Lenz, Lauren Holt
title Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_short Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_full Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_fullStr Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_full_unstemmed Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
title_sort statistical methods to account for gene-level covariates in normalization of high-dimensional read-count data
publisher DigitalCommons@USU
publishDate 2018
url https://digitalcommons.usu.edu/etd/7392
https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd
work_keys_str_mv AT lenzlaurenholt statisticalmethodstoaccountforgenelevelcovariatesinnormalizationofhighdimensionalreadcountdata
_version_ 1719267886850113536