Scalable long read self-correction and assembly polishing with multiple sequence alignment

Abstract Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSEN...

Full description

Bibliographic Details
Main Authors:	Pierre Morisse, Camille Marchet, Antoine Limasset, Thierry Lecroq, Arnaud Lefebvre
Format:	Article
Language:	English
Published:	Nature Publishing Group 2021-01-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-020-80757-5

id	doaj-fcd66fffea6c4c26a56a20d13c28f227
record_format	Article
spelling	doaj-fcd66fffea6c4c26a56a20d13c28f2272021-01-17T12:39:21ZengNature Publishing GroupScientific Reports2045-23222021-01-0111111310.1038/s41598-020-80757-5Scalable long read self-correction and assembly polishing with multiple sequence alignmentPierre Morisse0Camille Marchet1Antoine Limasset2Thierry Lecroq3Arnaud Lefebvre4Univ Rennes, Inria, CNRS, IRISAUniv. Lille, CNRS, UMR 9189 - CRIStALUniv. Lille, CNRS, UMR 9189 - CRIStALNormandie Univ, UNIROUEN, LITISNormandie Univ, UNIROUEN, LITISAbstract Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .https://doi.org/10.1038/s41598-020-80757-5
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Pierre Morisse Camille Marchet Antoine Limasset Thierry Lecroq Arnaud Lefebvre
spellingShingle	Pierre Morisse Camille Marchet Antoine Limasset Thierry Lecroq Arnaud Lefebvre Scalable long read self-correction and assembly polishing with multiple sequence alignment Scientific Reports
author_facet	Pierre Morisse Camille Marchet Antoine Limasset Thierry Lecroq Arnaud Lefebvre
author_sort	Pierre Morisse
title	Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_short	Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_full	Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_fullStr	Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_full_unstemmed	Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_sort	scalable long read self-correction and assembly polishing with multiple sequence alignment
publisher	Nature Publishing Group
series	Scientific Reports
issn	2045-2322
publishDate	2021-01-01
description	Abstract Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .
url	https://doi.org/10.1038/s41598-020-80757-5
work_keys_str_mv	AT pierremorisse scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment AT camillemarchet scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment AT antoinelimasset scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment AT thierrylecroq scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment AT arnaudlefebvre scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment
_version_	1724334569293873152

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Similar Items