VGEA: an RNA viral assembly toolkit

Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform dru...

Full description

Bibliographic Details
Main Authors: Paul E. Oluniyi, Fehintola Ajogbasile, Judith Oguzie, Jessica Uwanibe, Adeyemi Kayode, Anise Happi, Alphonsus Ugwu, Testimony Olumade, Olusola Ogunsanya, Philomena Ehiaghe Eromon, Onikepe Folarin, Simon D.W. Frost, Jonathan Heeney, Christian T. Happi
Format: Article
Language:English
Published: PeerJ Inc. 2021-09-01
Series:PeerJ
Subjects:
NGS
Online Access:https://peerj.com/articles/12129.pdf
id doaj-f7f11a683951402aabc05068d0fc871c
record_format Article
collection DOAJ
language English
format Article
sources DOAJ
author Paul E. Oluniyi
Fehintola Ajogbasile
Judith Oguzie
Jessica Uwanibe
Adeyemi Kayode
Anise Happi
Alphonsus Ugwu
Testimony Olumade
Olusola Ogunsanya
Philomena Ehiaghe Eromon
Onikepe Folarin
Simon D.W. Frost
Jonathan Heeney
Christian T. Happi
spellingShingle Paul E. Oluniyi
Fehintola Ajogbasile
Judith Oguzie
Jessica Uwanibe
Adeyemi Kayode
Anise Happi
Alphonsus Ugwu
Testimony Olumade
Olusola Ogunsanya
Philomena Ehiaghe Eromon
Onikepe Folarin
Simon D.W. Frost
Jonathan Heeney
Christian T. Happi
VGEA: an RNA viral assembly toolkit
PeerJ
VGEA
NGS
Genome
Assembly
author_facet Paul E. Oluniyi
Fehintola Ajogbasile
Judith Oguzie
Jessica Uwanibe
Adeyemi Kayode
Anise Happi
Alphonsus Ugwu
Testimony Olumade
Olusola Ogunsanya
Philomena Ehiaghe Eromon
Onikepe Folarin
Simon D.W. Frost
Jonathan Heeney
Christian T. Happi
author_sort Paul E. Oluniyi
title VGEA: an RNA viral assembly toolkit
title_short VGEA: an RNA viral assembly toolkit
title_full VGEA: an RNA viral assembly toolkit
title_fullStr VGEA: an RNA viral assembly toolkit
title_full_unstemmed VGEA: an RNA viral assembly toolkit
title_sort vgea: an rna viral assembly toolkit
publisher PeerJ Inc.
series PeerJ
issn 2167-8359
publishDate 2021-09-01
description Next generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License.
topic VGEA
NGS
Genome
Assembly
url https://peerj.com/articles/12129.pdf
work_keys_str_mv AT pauleoluniyi vgeaanrnaviralassemblytoolkit
AT fehintolaajogbasile vgeaanrnaviralassemblytoolkit
AT judithoguzie vgeaanrnaviralassemblytoolkit
AT jessicauwanibe vgeaanrnaviralassemblytoolkit
AT adeyemikayode vgeaanrnaviralassemblytoolkit
AT anisehappi vgeaanrnaviralassemblytoolkit
AT alphonsusugwu vgeaanrnaviralassemblytoolkit
AT testimonyolumade vgeaanrnaviralassemblytoolkit
AT olusolaogunsanya vgeaanrnaviralassemblytoolkit
AT philomenaehiagheeromon vgeaanrnaviralassemblytoolkit
AT onikepefolarin vgeaanrnaviralassemblytoolkit
AT simondwfrost vgeaanrnaviralassemblytoolkit
AT jonathanheeney vgeaanrnaviralassemblytoolkit
AT christianthappi vgeaanrnaviralassemblytoolkit
_version_ 1717762104770953216
spelling doaj-f7f11a683951402aabc05068d0fc871c2021-09-08T15:05:17ZengPeerJ Inc.PeerJ2167-83592021-09-019e1212910.7717/peerj.12129VGEA: an RNA viral assembly toolkitPaul E. Oluniyi0Fehintola Ajogbasile1Judith Oguzie2Jessica Uwanibe3Adeyemi Kayode4Anise Happi5Alphonsus Ugwu6Testimony Olumade7Olusola Ogunsanya8Philomena Ehiaghe Eromon9Onikepe Folarin10Simon D.W. Frost11Jonathan Heeney12Christian T. Happi13Department of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaAfrican Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer’s University, Ede, Osun, NigeriaDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaDepartment of Veterinary Pathology, Faculty of Veterinary Medicine, University of Ibadan, Ibadan, Oyo, NigeriaAfrican Centre of Excellence for Genomics of Infectious Diseases (ACEGID), Redeemer’s University, Ede, Osun, NigeriaDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaMicrosoft Research, Redmond, WA, United States of AmericaDepartment of Veterinary Medicine, University of Cambridge, Cambridge, United KingdomDepartment of Biological Sciences, Faculty of Natural Sciences, Redeemer’s University, Ede, Osun, NigeriaNext generation sequencing (NGS)-based studies have vastly increased our understanding of viral diversity. Viral sequence data obtained from NGS experiments are a rich source of information, these data can be used to study their epidemiology, evolution, transmission patterns, and can also inform drug and vaccine design. Viral genomes, however, represent a great challenge to bioinformatics due to their high mutation rate and forming quasispecies in the same infected host, bringing about the need to implement advanced bioinformatics tools to assemble consensus genomes well-representative of the viral population circulating in individual patients. Many tools have been developed to preprocess sequencing reads, carry-out de novo or reference-assisted assembly of viral genomes and assess the quality of the genomes obtained. Most of these tools however exist as standalone workflows and usually require huge computational resources. Here we present (Viral Genomes Easily Analyzed), a Snakemake workflow for analyzing RNA viral genomes. VGEA enables users to map sequencing reads to the human genome to remove human contaminants, split bam files into forward and reverse reads, carry out de novo assembly of forward and reverse reads to generate contigs, pre-process reads for quality and contamination, map reads to a reference tailored to the sample using corrected contigs supplemented by the user’s choice of reference sequences and evaluate/compare genome assemblies. We designed a project with the aim of creating a flexible, easy-to-use and all-in-one pipeline from existing/stand-alone bioinformatics tools for viral genome analysis that can be deployed on a personal computer. VGEA was built on the Snakemake workflow management system and utilizes existing tools for each step: fastp (Chen et al., 2018) for read trimming and read-level quality control, BWA (Li & Durbin, 2009) for mapping sequencing reads to the human reference genome, SAMtools (Li et al., 2009) for extracting unmapped reads and also for splitting bam files into fastq files, IVA (Hunt et al., 2015) for de novo assembly to generate contigs, shiver (Wymant et al., 2018) to pre-process reads for quality and contamination, then map to a reference tailored to the sample using corrected contigs supplemented with the user’s choice of existing reference sequences, SeqKit (Shen et al., 2016) for cleaning shiver assembly for QUAST, QUAST (Gurevich et al., 2013) to evaluate/assess the quality of genome assemblies and MultiQC (Ewels et al., 2016) for aggregation of the results from fastp, BWA and QUAST. Our pipeline was successfully tested and validated with SARS-CoV-2 (n = 20), HIV-1 (n = 20) and Lassa Virus (n = 20) datasets all of which have been made publicly available. VGEA is freely available on GitHub at: https://github.com/pauloluniyi/VGEA under the GNU General Public License.https://peerj.com/articles/12129.pdfVGEANGSGenomeAssembly