Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses

High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically...

Full description

Bibliographic Details
Main Authors: Md. Nafis Ul Alam, Umar Faruq Chowdhury, Ruslan Kalendar
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/?tool=EBI
id doaj-93c39c4a779843f69f86f11d97db183a
record_format Article
spelling doaj-93c39c4a779843f69f86f11d97db183a2020-11-25T03:54:39ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01159Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA virusesMd. Nafis Ul AlamUmar Faruq ChowdhuryRuslan KalendarHigh-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/?tool=EBI
collection DOAJ
language English
format Article
sources DOAJ
author Md. Nafis Ul Alam
Umar Faruq Chowdhury
Ruslan Kalendar
spellingShingle Md. Nafis Ul Alam
Umar Faruq Chowdhury
Ruslan Kalendar
Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
PLoS ONE
author_facet Md. Nafis Ul Alam
Umar Faruq Chowdhury
Ruslan Kalendar
author_sort Md. Nafis Ul Alam
title Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_short Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_full Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_fullStr Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_full_unstemmed Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses
title_sort short k-mer abundance profiles yield robust machine learning features and accurate classifiers for rna viruses
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2020-01-01
description High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7500682/?tool=EBI
work_keys_str_mv AT mdnafisulalam shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses
AT umarfaruqchowdhury shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses
AT ruslankalendar shortkmerabundanceprofilesyieldrobustmachinelearningfeaturesandaccurateclassifiersforrnaviruses
_version_ 1724472496237838336