A random forest classifier for detecting rare variants in NGS data from viral populations

We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes,...

Full description

Bibliographic Details
Main Authors: Raunaq Malhotra, Manjari Jha, Mary Poss, Raj Acharya
Format: Article
Language:English
Published: Elsevier 2017-01-01
Series:Computational and Structural Biotechnology Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037017300399
id doaj-488ccf8c5c0747eda49d722525820c6b
record_format Article
spelling doaj-488ccf8c5c0747eda49d722525820c6b2020-11-25T01:53:23ZengElsevierComputational and Structural Biotechnology Journal2001-03702017-01-0115388395A random forest classifier for detecting rare variants in NGS data from viral populationsRaunaq Malhotra0Manjari Jha1Mary Poss2Raj Acharya3The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA; Corresponding author.The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USADepartment of Biology, The Pennsylvania State University, University Park, PA 16802, USASchool of Informatics and Computing, Indiana University, Bloomington, IN 47405, USAWe propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame.We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes. Keywords: Sequencing error detection, Reference free methods, Next-generation sequencing, Viral populations, Multi-resolution frames, Random forest classifierhttp://www.sciencedirect.com/science/article/pii/S2001037017300399
collection DOAJ
language English
format Article
sources DOAJ
author Raunaq Malhotra
Manjari Jha
Mary Poss
Raj Acharya
spellingShingle Raunaq Malhotra
Manjari Jha
Mary Poss
Raj Acharya
A random forest classifier for detecting rare variants in NGS data from viral populations
Computational and Structural Biotechnology Journal
author_facet Raunaq Malhotra
Manjari Jha
Mary Poss
Raj Acharya
author_sort Raunaq Malhotra
title A random forest classifier for detecting rare variants in NGS data from viral populations
title_short A random forest classifier for detecting rare variants in NGS data from viral populations
title_full A random forest classifier for detecting rare variants in NGS data from viral populations
title_fullStr A random forest classifier for detecting rare variants in NGS data from viral populations
title_full_unstemmed A random forest classifier for detecting rare variants in NGS data from viral populations
title_sort random forest classifier for detecting rare variants in ngs data from viral populations
publisher Elsevier
series Computational and Structural Biotechnology Journal
issn 2001-0370
publishDate 2017-01-01
description We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame.We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes. Keywords: Sequencing error detection, Reference free methods, Next-generation sequencing, Viral populations, Multi-resolution frames, Random forest classifier
url http://www.sciencedirect.com/science/article/pii/S2001037017300399
work_keys_str_mv AT raunaqmalhotra arandomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
AT manjarijha arandomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
AT maryposs arandomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
AT rajacharya arandomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
AT raunaqmalhotra randomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
AT manjarijha randomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
AT maryposs randomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
AT rajacharya randomforestclassifierfordetectingrarevariantsinngsdatafromviralpopulations
_version_ 1724991162781007872