Using the longest run subsequence problem within homology-based scaffolding

Abstract Genome assembly is one of the most important problems in computational genomics. Here, we suggest addressing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a relat...

Full description

Bibliographic Details
Main Authors: Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, Gunnar W. Klau
Format: Article
Language:English
Published: BMC 2021-06-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:https://doi.org/10.1186/s13015-021-00191-8
id doaj-4603bc53f7e9405499ebea0f664c4cd7
record_format Article
spelling doaj-4603bc53f7e9405499ebea0f664c4cd72021-07-04T11:03:17ZengBMCAlgorithms for Molecular Biology1748-71882021-06-0116111110.1186/s13015-021-00191-8Using the longest run subsequence problem within homology-based scaffoldingSven Schrinner0Manish Goel1Michael Wulfert2Philipp Spohr3Korbinian Schneeberger4Gunnar W. Klau5Algorithmic Bioinformatics, Heinrich Heine University DüsseldorfMax Planck Institute for Plant Breeding ResearchHeinrich Heine University DüsseldorfAlgorithmic Bioinformatics, Heinrich Heine University DüsseldorfCluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University DüsseldorfAlgorithmic Bioinformatics, Heinrich Heine University DüsseldorfAbstract Genome assembly is one of the most important problems in computational genomics. Here, we suggest addressing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a related species. The idea is to use alignments of binned regions in one contig to find the most homologous contig in the other assembly. We show that ordering the contigs of the other assembly can be expressed by a new string problem, the longest run subsequence problem (LRS). We show that LRS is NP-hard and present reduction rules and two algorithmic approaches that, together, are able to solve large instances of LRS to provable optimality. All data used in the experiments as well as our source code are freely available. We demonstrate its usefulness within an existing larger scaffolding approach by solving realistic instances resulting from partial Arabidopsis thaliana assemblies in short computation time.https://doi.org/10.1186/s13015-021-00191-8AlignmentAssemblyString algorithmLongest subsequence
collection DOAJ
language English
format Article
sources DOAJ
author Sven Schrinner
Manish Goel
Michael Wulfert
Philipp Spohr
Korbinian Schneeberger
Gunnar W. Klau
spellingShingle Sven Schrinner
Manish Goel
Michael Wulfert
Philipp Spohr
Korbinian Schneeberger
Gunnar W. Klau
Using the longest run subsequence problem within homology-based scaffolding
Algorithms for Molecular Biology
Alignment
Assembly
String algorithm
Longest subsequence
author_facet Sven Schrinner
Manish Goel
Michael Wulfert
Philipp Spohr
Korbinian Schneeberger
Gunnar W. Klau
author_sort Sven Schrinner
title Using the longest run subsequence problem within homology-based scaffolding
title_short Using the longest run subsequence problem within homology-based scaffolding
title_full Using the longest run subsequence problem within homology-based scaffolding
title_fullStr Using the longest run subsequence problem within homology-based scaffolding
title_full_unstemmed Using the longest run subsequence problem within homology-based scaffolding
title_sort using the longest run subsequence problem within homology-based scaffolding
publisher BMC
series Algorithms for Molecular Biology
issn 1748-7188
publishDate 2021-06-01
description Abstract Genome assembly is one of the most important problems in computational genomics. Here, we suggest addressing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a related species. The idea is to use alignments of binned regions in one contig to find the most homologous contig in the other assembly. We show that ordering the contigs of the other assembly can be expressed by a new string problem, the longest run subsequence problem (LRS). We show that LRS is NP-hard and present reduction rules and two algorithmic approaches that, together, are able to solve large instances of LRS to provable optimality. All data used in the experiments as well as our source code are freely available. We demonstrate its usefulness within an existing larger scaffolding approach by solving realistic instances resulting from partial Arabidopsis thaliana assemblies in short computation time.
topic Alignment
Assembly
String algorithm
Longest subsequence
url https://doi.org/10.1186/s13015-021-00191-8
work_keys_str_mv AT svenschrinner usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding
AT manishgoel usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding
AT michaelwulfert usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding
AT philippspohr usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding
AT korbinianschneeberger usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding
AT gunnarwklau usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding
_version_ 1721320724394672128