Using the longest run subsequence problem within homology-based scaffolding
Abstract Genome assembly is one of the most important problems in computational genomics. Here, we suggest addressing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a relat...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2021-06-01
|
Series: | Algorithms for Molecular Biology |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13015-021-00191-8 |
id |
doaj-4603bc53f7e9405499ebea0f664c4cd7 |
---|---|
record_format |
Article |
spelling |
doaj-4603bc53f7e9405499ebea0f664c4cd72021-07-04T11:03:17ZengBMCAlgorithms for Molecular Biology1748-71882021-06-0116111110.1186/s13015-021-00191-8Using the longest run subsequence problem within homology-based scaffoldingSven Schrinner0Manish Goel1Michael Wulfert2Philipp Spohr3Korbinian Schneeberger4Gunnar W. Klau5Algorithmic Bioinformatics, Heinrich Heine University DüsseldorfMax Planck Institute for Plant Breeding ResearchHeinrich Heine University DüsseldorfAlgorithmic Bioinformatics, Heinrich Heine University DüsseldorfCluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University DüsseldorfAlgorithmic Bioinformatics, Heinrich Heine University DüsseldorfAbstract Genome assembly is one of the most important problems in computational genomics. Here, we suggest addressing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a related species. The idea is to use alignments of binned regions in one contig to find the most homologous contig in the other assembly. We show that ordering the contigs of the other assembly can be expressed by a new string problem, the longest run subsequence problem (LRS). We show that LRS is NP-hard and present reduction rules and two algorithmic approaches that, together, are able to solve large instances of LRS to provable optimality. All data used in the experiments as well as our source code are freely available. We demonstrate its usefulness within an existing larger scaffolding approach by solving realistic instances resulting from partial Arabidopsis thaliana assemblies in short computation time.https://doi.org/10.1186/s13015-021-00191-8AlignmentAssemblyString algorithmLongest subsequence |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Sven Schrinner Manish Goel Michael Wulfert Philipp Spohr Korbinian Schneeberger Gunnar W. Klau |
spellingShingle |
Sven Schrinner Manish Goel Michael Wulfert Philipp Spohr Korbinian Schneeberger Gunnar W. Klau Using the longest run subsequence problem within homology-based scaffolding Algorithms for Molecular Biology Alignment Assembly String algorithm Longest subsequence |
author_facet |
Sven Schrinner Manish Goel Michael Wulfert Philipp Spohr Korbinian Schneeberger Gunnar W. Klau |
author_sort |
Sven Schrinner |
title |
Using the longest run subsequence problem within homology-based scaffolding |
title_short |
Using the longest run subsequence problem within homology-based scaffolding |
title_full |
Using the longest run subsequence problem within homology-based scaffolding |
title_fullStr |
Using the longest run subsequence problem within homology-based scaffolding |
title_full_unstemmed |
Using the longest run subsequence problem within homology-based scaffolding |
title_sort |
using the longest run subsequence problem within homology-based scaffolding |
publisher |
BMC |
series |
Algorithms for Molecular Biology |
issn |
1748-7188 |
publishDate |
2021-06-01 |
description |
Abstract Genome assembly is one of the most important problems in computational genomics. Here, we suggest addressing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a related species. The idea is to use alignments of binned regions in one contig to find the most homologous contig in the other assembly. We show that ordering the contigs of the other assembly can be expressed by a new string problem, the longest run subsequence problem (LRS). We show that LRS is NP-hard and present reduction rules and two algorithmic approaches that, together, are able to solve large instances of LRS to provable optimality. All data used in the experiments as well as our source code are freely available. We demonstrate its usefulness within an existing larger scaffolding approach by solving realistic instances resulting from partial Arabidopsis thaliana assemblies in short computation time. |
topic |
Alignment Assembly String algorithm Longest subsequence |
url |
https://doi.org/10.1186/s13015-021-00191-8 |
work_keys_str_mv |
AT svenschrinner usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding AT manishgoel usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding AT michaelwulfert usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding AT philippspohr usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding AT korbinianschneeberger usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding AT gunnarwklau usingthelongestrunsubsequenceproblemwithinhomologybasedscaffolding |
_version_ |
1721320724394672128 |