Comparison of sequencing data processing pipelines and application to underrepresented African human populations

Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluate...

Full description

Bibliographic Details
Main Authors:	Gwenna Breton, Anna C. V. Johansson, Per Sjödin, Carina M. Schlebusch, Mattias Jakobsson
Format:	Article
Language:	English
Published:	BMC 2021-10-01
Series:	BMC Bioinformatics
Subjects:	Genome Analysis Toolkit (GATK) High-throughput sequencing (HTS) Next generation sequencing (NGS) High coverage genomes Underrepresented ancestry Comparison of pipelines
Online Access:	https://doi.org/10.1186/s12859-021-04407-x

id	doaj-4d612689b2e6408c86f15421b8027ec4
record_format	Article
spelling	doaj-4d612689b2e6408c86f15421b8027ec42021-10-10T11:14:35ZengBMCBMC Bioinformatics1471-21052021-10-0122112410.1186/s12859-021-04407-xComparison of sequencing data processing pipelines and application to underrepresented African human populationsGwenna Breton0Anna C. V. Johansson1Per Sjödin2Carina M. Schlebusch3Mattias Jakobsson4Human Evolution, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala UniversityDepartment of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala UniversityHuman Evolution, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala UniversityHuman Evolution, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala UniversityHuman Evolution, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala UniversityAbstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.https://doi.org/10.1186/s12859-021-04407-xGenome Analysis Toolkit (GATK)High-throughput sequencing (HTS)Next generation sequencing (NGS)High coverage genomesUnderrepresented ancestryComparison of pipelines
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Gwenna Breton Anna C. V. Johansson Per Sjödin Carina M. Schlebusch Mattias Jakobsson
spellingShingle	Gwenna Breton Anna C. V. Johansson Per Sjödin Carina M. Schlebusch Mattias Jakobsson Comparison of sequencing data processing pipelines and application to underrepresented African human populations BMC Bioinformatics Genome Analysis Toolkit (GATK) High-throughput sequencing (HTS) Next generation sequencing (NGS) High coverage genomes Underrepresented ancestry Comparison of pipelines
author_facet	Gwenna Breton Anna C. V. Johansson Per Sjödin Carina M. Schlebusch Mattias Jakobsson
author_sort	Gwenna Breton
title	Comparison of sequencing data processing pipelines and application to underrepresented African human populations
title_short	Comparison of sequencing data processing pipelines and application to underrepresented African human populations
title_full	Comparison of sequencing data processing pipelines and application to underrepresented African human populations
title_fullStr	Comparison of sequencing data processing pipelines and application to underrepresented African human populations
title_full_unstemmed	Comparison of sequencing data processing pipelines and application to underrepresented African human populations
title_sort	comparison of sequencing data processing pipelines and application to underrepresented african human populations
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2021-10-01
description	Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.
topic	Genome Analysis Toolkit (GATK) High-throughput sequencing (HTS) Next generation sequencing (NGS) High coverage genomes Underrepresented ancestry Comparison of pipelines
url	https://doi.org/10.1186/s12859-021-04407-x
work_keys_str_mv	AT gwennabreton comparisonofsequencingdataprocessingpipelinesandapplicationtounderrepresentedafricanhumanpopulations AT annacvjohansson comparisonofsequencingdataprocessingpipelinesandapplicationtounderrepresentedafricanhumanpopulations AT persjodin comparisonofsequencingdataprocessingpipelinesandapplicationtounderrepresentedafricanhumanpopulations AT carinamschlebusch comparisonofsequencingdataprocessingpipelinesandapplicationtounderrepresentedafricanhumanpopulations AT mattiasjakobsson comparisonofsequencingdataprocessingpipelinesandapplicationtounderrepresentedafricanhumanpopulations
_version_	1716829899343265792

Comparison of sequencing data processing pipelines and application to underrepresented African human populations

Similar Items