Fast motif recognition via application of statistical thresholds

<p>Abstract</p> <p>Background</p> <p>Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the C<...

Full description

Bibliographic Details
Main Authors: King James, Boucher Christina
Format: Article
Language:English
Published: BMC 2010-01-01
Series:BMC Bioinformatics
id doaj-bcf5b2fee1114f39a6aa3dc5197299f5
record_format Article
spelling doaj-bcf5b2fee1114f39a6aa3dc5197299f52020-11-25T01:37:18ZengBMCBMC Bioinformatics1471-21052010-01-0111Suppl 1S1110.1186/1471-2105-11-S1-S11Fast motif recognition via application of statistical thresholdsKing JamesBoucher Christina<p>Abstract</p> <p>Background</p> <p>Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> decision problem that asks, given a parameter <it>d </it>and a set of ℓ-length strings <it>S </it>= {<it>s</it><sub>1</sub>, ..., <it>s</it><sub><it>n</it></sub>}, whether there exists a consensus string that has Hamming distance at most <it>d </it>from any string in <it>S</it>. A set of strings <it>S </it>is <it>pairwise bounded </it>if the Hamming distance between any pair of strings in <it>S </it>is at most 2<it>d</it>. It is trivial to determine whether a set is pairwise bounded, and a set cannot have a consensus string unless it is pairwise bounded. We use C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> to determine whether or not a pairwise bounded set has a consensus. Unfortunately, C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> is NP-complete. The lack of an efficient method to solve the C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> problem has caused it to become a computational bottleneck in <it>MCL-WMR</it>, a motif recognition program capable of solving difficult motif recognition problem instances.</p> <p>Results</p> <p>We focus on the development of a method for solving C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> quickly with a small probability of error. We apply this heuristic to develop a new motif recognition program, <it>sMCL-WMR</it>, which has impressive accuracy and efficiency. We demonstrate the performance of <it>sMCL-WMR </it>in detecting weak motifs in large data sets and in real genomic data sets, and compare the performance to other leading motif recognition programs. In our preliminary discussion of our C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> algorithm we give insight into the issue of sampling pairwise bounded sets, and discuss its relevance to motif recognition.</p> <p>Conclusion</p> <p>Our novel heuristic gives birth to a state of the art program, <it>sMCL-WMR</it>, that is capable of detecting weak motifs in data sets with a large number of strings. <it>sMCL-WMR </it>is orders of magnitude faster than its predecessor <it>MCL-WMR </it>and is capable of solving previously unsolved synthetic motif recognition problems. Lastly, <it>sMCL-WMR </it>shows impressive accuracy in detecting transcription factor binding sites in the genomic data and used in the assessment of Tompa <it>et al</it>.</p>
collection DOAJ
language English
format Article
sources DOAJ
author King James
Boucher Christina
spellingShingle King James
Boucher Christina
Fast motif recognition via application of statistical thresholds
BMC Bioinformatics
author_facet King James
Boucher Christina
author_sort King James
title Fast motif recognition via application of statistical thresholds
title_short Fast motif recognition via application of statistical thresholds
title_full Fast motif recognition via application of statistical thresholds
title_fullStr Fast motif recognition via application of statistical thresholds
title_full_unstemmed Fast motif recognition via application of statistical thresholds
title_sort fast motif recognition via application of statistical thresholds
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2010-01-01
description <p>Abstract</p> <p>Background</p> <p>Improving the accuracy and efficiency of motif recognition is an important computational challenge that has application to detecting transcription factor binding sites in genomic data. Closely related to motif recognition is the C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> decision problem that asks, given a parameter <it>d </it>and a set of ℓ-length strings <it>S </it>= {<it>s</it><sub>1</sub>, ..., <it>s</it><sub><it>n</it></sub>}, whether there exists a consensus string that has Hamming distance at most <it>d </it>from any string in <it>S</it>. A set of strings <it>S </it>is <it>pairwise bounded </it>if the Hamming distance between any pair of strings in <it>S </it>is at most 2<it>d</it>. It is trivial to determine whether a set is pairwise bounded, and a set cannot have a consensus string unless it is pairwise bounded. We use C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> to determine whether or not a pairwise bounded set has a consensus. Unfortunately, C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> is NP-complete. The lack of an efficient method to solve the C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> problem has caused it to become a computational bottleneck in <it>MCL-WMR</it>, a motif recognition program capable of solving difficult motif recognition problem instances.</p> <p>Results</p> <p>We focus on the development of a method for solving C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> quickly with a small probability of error. We apply this heuristic to develop a new motif recognition program, <it>sMCL-WMR</it>, which has impressive accuracy and efficiency. We demonstrate the performance of <it>sMCL-WMR </it>in detecting weak motifs in large data sets and in real genomic data sets, and compare the performance to other leading motif recognition programs. In our preliminary discussion of our C<smcaps>ONSENSUS</smcaps> S<smcaps>TRING</smcaps> algorithm we give insight into the issue of sampling pairwise bounded sets, and discuss its relevance to motif recognition.</p> <p>Conclusion</p> <p>Our novel heuristic gives birth to a state of the art program, <it>sMCL-WMR</it>, that is capable of detecting weak motifs in data sets with a large number of strings. <it>sMCL-WMR </it>is orders of magnitude faster than its predecessor <it>MCL-WMR </it>and is capable of solving previously unsolved synthetic motif recognition problems. Lastly, <it>sMCL-WMR </it>shows impressive accuracy in detecting transcription factor binding sites in the genomic data and used in the assessment of Tompa <it>et al</it>.</p>
work_keys_str_mv AT kingjames fastmotifrecognitionviaapplicationofstatisticalthresholds
AT boucherchristina fastmotifrecognitionviaapplicationofstatisticalthresholds
_version_ 1725058413757464576