Fast and exact quantification of motif occurrences in biological sequences

Background: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high comput...

Full description

Bibliographic Details
Main Authors: Boucher, C. (Author), Marini, S. (Author), Prosperi, M. (Author)
Format: Article
Language:English
Published: BioMed Central Ltd 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 03512nam a2200433Ia 4500
001 10.1186-s12859-021-04355-6
008 220427s2021 CNT 000 0 und d
020 |a 14712105 (ISSN) 
245 1 0 |a Fast and exact quantification of motif occurrences in biological sequences 
260 0 |b BioMed Central Ltd  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1186/s12859-021-04355-6 
520 3 |a Background: Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. Results: We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob. Conclusions: The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency. © 2021, The Author(s). 
650 0 4 |a algorithm 
650 0 4 |a Algorithms 
650 0 4 |a Antimicrobial resistances 
650 0 4 |a Arbitrary precision 
650 0 4 |a Bacteria 
650 0 4 |a Bioinformatics 
650 0 4 |a Biological mechanisms 
650 0 4 |a Biological sequences 
650 0 4 |a C++ (programming language) 
650 0 4 |a Data handling 
650 0 4 |a Fast implementation 
650 0 4 |a Iterative methods 
650 0 4 |a Markov model 
650 0 4 |a Motif characterization 
650 0 4 |a Motifs 
650 0 4 |a Open source software 
650 0 4 |a Open systems 
650 0 4 |a Open-source solutions 
650 0 4 |a Orders of magnitude 
650 0 4 |a Probability distribution 
650 0 4 |a software 
650 0 4 |a Software 
650 0 4 |a Transcription 
700 1 |a Boucher, C.  |e author 
700 1 |a Marini, S.  |e author 
700 1 |a Prosperi, M.  |e author 
773 |t BMC Bioinformatics