Fast and exact quantification of motif occurrences in biological sequences

Background: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high comput...

Full description

Bibliographic Details
Main Authors:	Boucher, C. (Author), Marini, S. (Author), Prosperi, M. (Author)
Format:	Article
Language:	English
Published:	BioMed Central Ltd 2021
Subjects:	algorithm Algorithms Antimicrobial resistances Arbitrary precision Bacteria Bioinformatics Biological mechanisms Biological sequences C++ (programming language) Data handling Fast implementation Iterative methods Markov model Motif characterization Motifs Open source software Open systems Open-source solutions Orders of magnitude Probability distribution software Software Transcription
Online Access:	View Fulltext in Publisher


LEADER	03512nam a2200433Ia 4500
001	10.1186-s12859-021-04355-6
008	220427s2021 CNT 000 0 und d
020			\|a 14712105 (ISSN)
245	1	0	\|a Fast and exact quantification of motif occurrences in biological sequences
260		0	\|b BioMed Central Ltd \|c 2021
856			\|z View Fulltext in Publisher \|u https://doi.org/10.1186/s12859-021-04355-6
520	3		\|a Background: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. Results: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob. Conclusions: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency. © 2021, The Author(s).
650	0	4	\|a algorithm
650	0	4	\|a Algorithms
650	0	4	\|a Antimicrobial resistances
650	0	4	\|a Arbitrary precision
650	0	4	\|a Bacteria
650	0	4	\|a Bioinformatics
650	0	4	\|a Biological mechanisms
650	0	4	\|a Biological sequences
650	0	4	\|a C++ (programming language)
650	0	4	\|a Data handling
650	0	4	\|a Fast implementation
650	0	4	\|a Iterative methods
650	0	4	\|a Markov model
650	0	4	\|a Motif characterization
650	0	4	\|a Motifs
650	0	4	\|a Open source software
650	0	4	\|a Open systems
650	0	4	\|a Open-source solutions
650	0	4	\|a Orders of magnitude
650	0	4	\|a Probability distribution
650	0	4	\|a software
650	0	4	\|a Software
650	0	4	\|a Transcription
700	1		\|a Boucher, C. \|e author
700	1		\|a Marini, S. \|e author
700	1		\|a Prosperi, M. \|e author
773			\|t BMC Bioinformatics

Fast and exact quantification of motif occurrences in biological sequences

Similar Items