Removing contaminants from databases of draft genomes.

Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification...

Full description

Bibliographic Details
Main Authors: Jennifer Lu, Steven L Salzberg
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2018-06-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1006277
id doaj-97ce81a3d20246afb15fe906a474706a
record_format Article
spelling doaj-97ce81a3d20246afb15fe906a474706a2021-04-21T15:06:32ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582018-06-01146e100627710.1371/journal.pcbi.1006277Removing contaminants from databases of draft genomes.Jennifer LuSteven L SalzbergMetagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of "clean" eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.https://doi.org/10.1371/journal.pcbi.1006277
collection DOAJ
language English
format Article
sources DOAJ
author Jennifer Lu
Steven L Salzberg
spellingShingle Jennifer Lu
Steven L Salzberg
Removing contaminants from databases of draft genomes.
PLoS Computational Biology
author_facet Jennifer Lu
Steven L Salzberg
author_sort Jennifer Lu
title Removing contaminants from databases of draft genomes.
title_short Removing contaminants from databases of draft genomes.
title_full Removing contaminants from databases of draft genomes.
title_fullStr Removing contaminants from databases of draft genomes.
title_full_unstemmed Removing contaminants from databases of draft genomes.
title_sort removing contaminants from databases of draft genomes.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2018-06-01
description Metagenomic sequencing of patient samples is a very promising method for the diagnosis of human infections. Sequencing has the ability to capture all the DNA or RNA from pathogenic organisms in a human sample. However, complete and accurate characterization of the sequence, including identification of any pathogens, depends on the availability and quality of genomes for comparison. Thousands of genomes are now available, and as these numbers grow, the power of metagenomic sequencing for diagnosis should increase. However, recent studies have exposed the presence of contamination in published genomes, which when used for diagnosis increases the risk of falsely identifying the wrong pathogen. To address this problem, we have developed a bioinformatics system for eliminating contamination as well as low-complexity genomic sequences in the draft genomes of eukaryotic pathogens. We applied this software to identify and remove human, bacterial, archaeal, and viral sequences present in a comprehensive database of all sequenced eukaryotic pathogen genomes. We also removed low-complexity genomic sequences, another source of false positives. Using this pipeline, we have produced a database of "clean" eukaryotic pathogen genomes for use with bioinformatics classification and analysis tools. We demonstrate that when attempting to find eukaryotic pathogens in metagenomic samples, the new database provides better sensitivity than one using the original genomes while offering a dramatic reduction in false positives.
url https://doi.org/10.1371/journal.pcbi.1006277
work_keys_str_mv AT jenniferlu removingcontaminantsfromdatabasesofdraftgenomes
AT stevenlsalzberg removingcontaminantsfromdatabasesofdraftgenomes
_version_ 1714668060630056960