Context-aware seeds for read mapping

Abstract Motivation Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows, this seeding scheme forces mappers to use more and shorter see...

Full description

Bibliographic Details
Main Authors: Hongyi Xin, Mingfu Shao, Carl Kingsford
Format: Article
Language:English
Published: BMC 2020-05-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-020-00172-3
id doaj-78746fdbfe3845f38d771556712751e1
record_format Article
spelling doaj-78746fdbfe3845f38d771556712751e12020-11-25T03:05:34ZengBMCAlgorithms for Molecular Biology1748-71882020-05-0115111210.1186/s13015-020-00172-3Context-aware seeds for read mappingHongyi Xin0Mingfu Shao1Carl Kingsford2Computer Science Department, Carnegie Mellon UniversityDepartment of Computer Science and Engineering, Pennsylvania State UniversityComputational Biology Department, Carnegie Mellon UniversityAbstract Motivation Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows, this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mappings but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS significantly reduces seed frequencies when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver. Availability https://github.com/Kingsford-Group/CAS_codehttp://link.springer.com/article/10.1186/s13015-020-00172-3Read mappingSeedsError toleranceSeed and extend
collection DOAJ
language English
format Article
sources DOAJ
author Hongyi Xin
Mingfu Shao
Carl Kingsford
spellingShingle Hongyi Xin
Mingfu Shao
Carl Kingsford
Context-aware seeds for read mapping
Algorithms for Molecular Biology
Read mapping
Seeds
Error tolerance
Seed and extend
author_facet Hongyi Xin
Mingfu Shao
Carl Kingsford
author_sort Hongyi Xin
title Context-aware seeds for read mapping
title_short Context-aware seeds for read mapping
title_full Context-aware seeds for read mapping
title_fullStr Context-aware seeds for read mapping
title_full_unstemmed Context-aware seeds for read mapping
title_sort context-aware seeds for read mapping
publisher BMC
series Algorithms for Molecular Biology
issn 1748-7188
publishDate 2020-05-01
description Abstract Motivation Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting t non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of t. As t grows, this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers. Results We propose a novel seeding framework, context-aware seeds (CAS). CAS guarantees finding all valid mappings but uses fewer (and longer) seeds, which reduces seed frequencies and increases efficiency of mappers. CAS achieves this improvement by attaching a confidence radius to each seed in the reference. We prove that all valid mappings can be found if the sum of confidence radii of seeds are greater than t. CAS generalizes the existing pigeonhole-principle-based seeding scheme in which this confidence radius is implicitly always 1. Moreover, we design an efficient algorithm that constructs the confidence radius database in linear time. We experiment CAS with E. coli genome and show that CAS significantly reduces seed frequencies when compared with the state-of-the-art pigeonhole-principle-based seeding algorithm, the Optimal Seed Solver. Availability https://github.com/Kingsford-Group/CAS_code
topic Read mapping
Seeds
Error tolerance
Seed and extend
url http://link.springer.com/article/10.1186/s13015-020-00172-3
work_keys_str_mv AT hongyixin contextawareseedsforreadmapping
AT mingfushao contextawareseedsforreadmapping
AT carlkingsford contextawareseedsforreadmapping
_version_ 1724677794266349568