SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data

Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the un...

Full description

Bibliographic Details
Main Authors: Luca Ferretti, Chandana Tennakoon, Adrian Silesian, Graham Freimanis, Paolo Ribeca
Format: Article
Language:English
Published: MDPI AG 2019-07-01
Series:Genes
Subjects:
Online Access:https://www.mdpi.com/2073-4425/10/8/561
id doaj-399a9651bf1c41c29057abafc2452e53
record_format Article
spelling doaj-399a9651bf1c41c29057abafc2452e532020-11-25T01:27:00ZengMDPI AGGenes2073-44252019-07-0110856110.3390/genes10080561genes10080561SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing DataLuca Ferretti0Chandana Tennakoon1Adrian Silesian2Graham Freimanis3Paolo Ribeca4Integrative Biology and Bioinformatics, The Pirbright Institute, Woking GU24 0NF, UKIntegrative Biology and Bioinformatics, The Pirbright Institute, Woking GU24 0NF, UKIntegrative Biology and Bioinformatics, The Pirbright Institute, Woking GU24 0NF, UKIntegrative Biology and Bioinformatics, The Pirbright Institute, Woking GU24 0NF, UKIntegrative Biology and Bioinformatics, The Pirbright Institute, Woking GU24 0NF, UKCurrent high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.https://www.mdpi.com/2073-4425/10/8/561next generation sequencinglow-frequency variantsheterogeneous populationsBayesian modelling
collection DOAJ
language English
format Article
sources DOAJ
author Luca Ferretti
Chandana Tennakoon
Adrian Silesian
Graham Freimanis
Paolo Ribeca
spellingShingle Luca Ferretti
Chandana Tennakoon
Adrian Silesian
Graham Freimanis
Paolo Ribeca
SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
Genes
next generation sequencing
low-frequency variants
heterogeneous populations
Bayesian modelling
author_facet Luca Ferretti
Chandana Tennakoon
Adrian Silesian
Graham Freimanis
Paolo Ribeca
author_sort Luca Ferretti
title SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_short SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_full SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_fullStr SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_full_unstemmed SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data
title_sort sinple: fast and sensitive variant calling for deep sequencing data
publisher MDPI AG
series Genes
issn 2073-4425
publishDate 2019-07-01
description Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.
topic next generation sequencing
low-frequency variants
heterogeneous populations
Bayesian modelling
url https://www.mdpi.com/2073-4425/10/8/561
work_keys_str_mv AT lucaferretti sinplefastandsensitivevariantcallingfordeepsequencingdata
AT chandanatennakoon sinplefastandsensitivevariantcallingfordeepsequencingdata
AT adriansilesian sinplefastandsensitivevariantcallingfordeepsequencingdata
AT grahamfreimanis sinplefastandsensitivevariantcallingfordeepsequencingdata
AT paoloribeca sinplefastandsensitivevariantcallingfordeepsequencingdata
_version_ 1725107513763823616