Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments

Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal components analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a g...

Full description

Bibliographic Details
Main Author: Reese, Sarah
Format: Others
Published: VCU Scholars Compass 2013
Subjects:
Online Access:http://scholarscompass.vcu.edu/etd/3180
http://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=4179&context=etd
id ndltd-vcu.edu-oai-scholarscompass.vcu.edu-etd-4179
record_format oai_dc
spelling ndltd-vcu.edu-oai-scholarscompass.vcu.edu-etd-41792017-03-17T08:27:14Z Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments Reese, Sarah Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal components analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. We present an extension of principal components analysis to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test if a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. We further compare existing batch effect correction methods and apply gPCA to test their effectiveness. We conclude that our novel statistic that utilizes guided principal components analysis to identify whether batch effects exist in high-throughput genomic data is effective. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well. 2013-04-19T07:00:00Z text application/pdf http://scholarscompass.vcu.edu/etd/3180 http://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=4179&context=etd © The Author Theses and Dissertations VCU Scholars Compass Biostatistics Batch Effects Principal Components Analysis Bioinformatics Biostatistics Physical Sciences and Mathematics Statistics and Probability
collection NDLTD
format Others
sources NDLTD
topic Biostatistics
Batch Effects
Principal Components Analysis
Bioinformatics
Biostatistics
Physical Sciences and Mathematics
Statistics and Probability
spellingShingle Biostatistics
Batch Effects
Principal Components Analysis
Bioinformatics
Biostatistics
Physical Sciences and Mathematics
Statistics and Probability
Reese, Sarah
Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments
description Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal components analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. We present an extension of principal components analysis to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test if a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. We further compare existing batch effect correction methods and apply gPCA to test their effectiveness. We conclude that our novel statistic that utilizes guided principal components analysis to identify whether batch effects exist in high-throughput genomic data is effective. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.
author Reese, Sarah
author_facet Reese, Sarah
author_sort Reese, Sarah
title Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments
title_short Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments
title_full Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments
title_fullStr Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments
title_full_unstemmed Detecting and Correcting Batch Effects in High-Throughput Genomic Experiments
title_sort detecting and correcting batch effects in high-throughput genomic experiments
publisher VCU Scholars Compass
publishDate 2013
url http://scholarscompass.vcu.edu/etd/3180
http://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=4179&context=etd
work_keys_str_mv AT reesesarah detectingandcorrectingbatcheffectsinhighthroughputgenomicexperiments
_version_ 1718427960492425216