CONSULT: Accurate contamination removal using locality-sensitive hashing

A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) p...

Full description

Bibliographic Details
Main Authors: Bafna, V. (Author), Mirarab, S. (Author), Rachtman, E. (Author)
Format: Article
Language:English
Published: Oxford University Press 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 02385nam a2200265Ia 4500
001 10.1093-nargab-lqab071
008 220427s2021 CNT 000 0 und d
020 |a 26319268 (ISSN) 
245 1 0 |a CONSULT: Accurate contamination removal using locality-sensitive hashing 
260 0 |b Oxford University Press  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1093/nargab/lqab071 
520 3 |a A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies. © 2021 The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. 
650 0 4 |a article 
650 0 4 |a calculation 
650 0 4 |a contamination 
650 0 4 |a human 
650 0 4 |a library 
650 0 4 |a locality sensitive hashing 
650 0 4 |a memory 
650 0 4 |a mitochondrion 
650 0 4 |a positivity rate 
700 1 |a Bafna, V.  |e author 
700 1 |a Mirarab, S.  |e author 
700 1 |a Rachtman, E.  |e author 
773 |t NAR Genomics and Bioinformatics