|
|
|
|
LEADER |
02385nam a2200265Ia 4500 |
001 |
10.1093-nargab-lqab071 |
008 |
220427s2021 CNT 000 0 und d |
020 |
|
|
|a 26319268 (ISSN)
|
245 |
1 |
0 |
|a CONSULT: Accurate contamination removal using locality-sensitive hashing
|
260 |
|
0 |
|b Oxford University Press
|c 2021
|
856 |
|
|
|z View Fulltext in Publisher
|u https://doi.org/10.1093/nargab/lqab071
|
520 |
3 |
|
|a A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies. © 2021 The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
|
650 |
0 |
4 |
|a article
|
650 |
0 |
4 |
|a calculation
|
650 |
0 |
4 |
|a contamination
|
650 |
0 |
4 |
|a human
|
650 |
0 |
4 |
|a library
|
650 |
0 |
4 |
|a locality sensitive hashing
|
650 |
0 |
4 |
|a memory
|
650 |
0 |
4 |
|a mitochondrion
|
650 |
0 |
4 |
|a positivity rate
|
700 |
1 |
|
|a Bafna, V.
|e author
|
700 |
1 |
|
|a Mirarab, S.
|e author
|
700 |
1 |
|
|a Rachtman, E.
|e author
|
773 |
|
|
|t NAR Genomics and Bioinformatics
|