Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines

Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitud...

Full description

Bibliographic Details
Main Authors: Stephan Weißbach, Stanislav Sys, Charlotte Hewel, Hristo Todorov, Susann Schweiger, Jennifer Winter, Markus Pfenninger, Ali Torkamani, Doug Evans, Joachim Burger, Karin Everschor-Sitte, Helen Louise May-Simera, Susanne Gerber
Format: Article
Language:English
Published: BMC 2021-01-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-020-07362-8
id doaj-2b0d301ea0b54cc79c6d546e161fa8bc
record_format Article
spelling doaj-2b0d301ea0b54cc79c6d546e161fa8bc2021-01-24T12:20:38ZengBMCBMC Genomics1471-21642021-01-0122111510.1186/s12864-020-07362-8Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelinesStephan Weißbach0Stanislav Sys1Charlotte Hewel2Hristo Todorov3Susann Schweiger4Jennifer Winter5Markus Pfenninger6Ali Torkamani7Doug Evans8Joachim Burger9Karin Everschor-Sitte10Helen Louise May-Simera11Susanne Gerber12Institute of Human Genetics, University Medical Center of the Johannes Gutenberg-University MainzInstitute of Human Genetics, University Medical Center of the Johannes Gutenberg-University MainzInstitute of Human Genetics, University Medical Center of the Johannes Gutenberg-University MainzInstitute of Human Genetics, University Medical Center of the Johannes Gutenberg-University MainzInstitute of Human Genetics, University Medical Center of the Johannes Gutenberg-University MainzInstitute of Human Genetics, University Medical Center of the Johannes Gutenberg-University MainzDepartment of Molecular Ecology, Senckenberg Biodiversity and Climate Research CentreDepartment of Integrative Structural and Computational Biology, Scripps Research Translational Institute, California CampusDepartment of Integrative Structural and Computational Biology, Scripps Research Translational Institute, California CampusInstitute of Anthropology, Johannes Gutenberg-University MainzInstitute of Physics, Johannes Gutenberg-University MainzInstitute of Molecular Physiology, Johannes Gutenberg-University MainzInstitute of Human Genetics, University Medical Center of the Johannes Gutenberg-University MainzAbstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.https://doi.org/10.1186/s12864-020-07362-8Next-generation sequencing (NGS) technologiesPlatform-biasesHealthy agingIlluminaWellderlyLongevity
collection DOAJ
language English
format Article
sources DOAJ
author Stephan Weißbach
Stanislav Sys
Charlotte Hewel
Hristo Todorov
Susann Schweiger
Jennifer Winter
Markus Pfenninger
Ali Torkamani
Doug Evans
Joachim Burger
Karin Everschor-Sitte
Helen Louise May-Simera
Susanne Gerber
spellingShingle Stephan Weißbach
Stanislav Sys
Charlotte Hewel
Hristo Todorov
Susann Schweiger
Jennifer Winter
Markus Pfenninger
Ali Torkamani
Doug Evans
Joachim Burger
Karin Everschor-Sitte
Helen Louise May-Simera
Susanne Gerber
Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
BMC Genomics
Next-generation sequencing (NGS) technologies
Platform-biases
Healthy aging
Illumina
Wellderly
Longevity
author_facet Stephan Weißbach
Stanislav Sys
Charlotte Hewel
Hristo Todorov
Susann Schweiger
Jennifer Winter
Markus Pfenninger
Ali Torkamani
Doug Evans
Joachim Burger
Karin Everschor-Sitte
Helen Louise May-Simera
Susanne Gerber
author_sort Stephan Weißbach
title Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
title_short Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
title_full Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
title_fullStr Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
title_full_unstemmed Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
title_sort reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2021-01-01
description Abstract Background Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform’s impact. Results The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups. Conclusion We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
topic Next-generation sequencing (NGS) technologies
Platform-biases
Healthy aging
Illumina
Wellderly
Longevity
url https://doi.org/10.1186/s12864-020-07362-8
work_keys_str_mv AT stephanweißbach reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT stanislavsys reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT charlottehewel reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT hristotodorov reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT susannschweiger reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT jenniferwinter reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT markuspfenninger reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT alitorkamani reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT dougevans reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT joachimburger reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT karineverschorsitte reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT helenlouisemaysimera reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
AT susannegerber reliabilityofgenomicvariantsacrossdifferentnextgenerationsequencingplatformsandbioinformaticprocessingpipelines
_version_ 1724325998947729408