Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data
The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression...
Main Author: | |
---|---|
Format: | Others |
Published: |
DigitalCommons@USU
2018
|
Subjects: | |
Online Access: | https://digitalcommons.usu.edu/etd/7392 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd |
id |
ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-8509 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-85092019-10-13T06:03:20Z Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data Lenz, Lauren Holt The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression of genes is though a process called RNA-Seq, that takes physical tissue samples and maps gene products and fragments in the sample back to the gene that created it, resulting in a large read-count matrix with genes in the rows and a column for each sample. The read-counts for tumor and normal samples are then compared in a process called differential expression analysis. However, normalization of these read-counts is a necessary pre-processing step, in order to account for differences in the read-count values due to non-expression related variables. It is common in recent RNA-Seq normalization methods to also account for gene-level covariates, namely gene length in base pairs and GC-content, the proportion of bases in the gene that are Guanine and Cytosine. Here a colorectal cancer RNA-Seq read-count data set comprised of 30,220 genes and 378 samples is examined. Two of the normalization methods that account for gene length and GC-content, CQN and EDASeq, are extended to account for protein coding status as a third gene-level covariate. The binary nature of protein coding status results in unique computation issues. The results of using the normalized read counts from CQN, EDASeq, and four new normalization methods are used for differential expression analysis via the nonparametric Wilcoxon Rank-Sum Test as well as the lme4 pipeline that produces per-gene models based on a negative binomial distribution. The resulting differential expression results are compared for two genes of interest in colorectal cancer, APC and CTNNB1, both of the WNT signaling pathway. 2018-12-01T08:00:00Z text application/pdf https://digitalcommons.usu.edu/etd/7392 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd Copyright for this work is held by the author. Transmission or reproduction of materials protected by copyright beyond that allowed by fair use requires the written permission of the copyright owners. Works not in the public domain cannot be commercially exploited without permission of the copyright owner. Responsibility for any use rests exclusively with the user. For more information contact digitalcommons@usu.edu. All Graduate Theses and Dissertations DigitalCommons@USU RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability |
spellingShingle |
RNA-Seq normalization gene-level covariates high-dimensional read-counts Statistics and Probability Lenz, Lauren Holt Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data |
description |
The goal of genetic-based cancer research is often to identify which genes behave differently in cancerous and healthy tissue. This difference in behavior, referred to as differential expression, may lead researchers to more targeted preventative care and treatment. One way to measure the expression of genes is though a process called RNA-Seq, that takes physical tissue samples and maps gene products and fragments in the sample back to the gene that created it, resulting in a large read-count matrix with genes in the rows and a column for each sample. The read-counts for tumor and normal samples are then compared in a process called differential expression analysis. However, normalization of these read-counts is a necessary pre-processing step, in order to account for differences in the read-count values due to non-expression related variables. It is common in recent RNA-Seq normalization methods to also account for gene-level covariates, namely gene length in base pairs and GC-content, the proportion of bases in the gene that are Guanine and Cytosine.
Here a colorectal cancer RNA-Seq read-count data set comprised of 30,220 genes and 378 samples is examined. Two of the normalization methods that account for gene length and GC-content, CQN and EDASeq, are extended to account for protein coding status as a third gene-level covariate. The binary nature of protein coding status results in unique computation issues. The results of using the normalized read counts from CQN, EDASeq, and four new normalization methods are used for differential expression analysis via the nonparametric Wilcoxon Rank-Sum Test as well as the lme4 pipeline that produces per-gene models based on a negative binomial distribution. The resulting differential expression results are compared for two genes of interest in colorectal cancer, APC and CTNNB1, both of the WNT signaling pathway. |
author |
Lenz, Lauren Holt |
author_facet |
Lenz, Lauren Holt |
author_sort |
Lenz, Lauren Holt |
title |
Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data |
title_short |
Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data |
title_full |
Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data |
title_fullStr |
Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data |
title_full_unstemmed |
Statistical Methods to Account for Gene-Level Covariates in Normalization of High-Dimensional Read-Count Data |
title_sort |
statistical methods to account for gene-level covariates in normalization of high-dimensional read-count data |
publisher |
DigitalCommons@USU |
publishDate |
2018 |
url |
https://digitalcommons.usu.edu/etd/7392 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=8509&context=etd |
work_keys_str_mv |
AT lenzlaurenholt statisticalmethodstoaccountforgenelevelcovariatesinnormalizationofhighdimensionalreadcountdata |
_version_ |
1719267886850113536 |