Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences

Understanding the mechanisms that regulate gene expression is a major challenge in biology. One of the most important tasks in this challenge is to identify the transcription factors binding sites (TFBS) in DNA sequences. The common representation of these binding sites is called “motif” and the dis...

Full description

Bibliographic Details
Main Author: Zhao, Xiaoyan
Other Authors: Sze, Sing-Hoi
Format: Others
Language:en_US
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8834
id ndltd-tamu.edu-oai-repository.tamu.edu-1969.1-ETD-TAMU-2010-12-8834
record_format oai_dc
spelling ndltd-tamu.edu-oai-repository.tamu.edu-1969.1-ETD-TAMU-2010-12-88342013-01-08T10:42:55ZImproved Algorithms for Discovery of Transcription Factor Binding Sites in DNA SequencesZhao, XiaoyanComputational BiologyMotif findingTranscriptionUnderstanding the mechanisms that regulate gene expression is a major challenge in biology. One of the most important tasks in this challenge is to identify the transcription factors binding sites (TFBS) in DNA sequences. The common representation of these binding sites is called “motif” and the discovery of TFBS problem is also referred as motif finding problem in computer science. Despite extensive efforts in the past decade, none of the existing algorithms perform very well. This dissertation focuses on this difficult problem and proposes three new methods (MotifEnumerator, PosMotif, and Enrich) with excellent improvements. An improved pattern-driven algorithm, MotifEnumerator, is first proposed to detect the optimal motif with reduced time complexity compared to the traditional exact pattern-driven approaches. This strategy is further extended to allow arbitrary don’t care positions within a motif without much decrease in solvable values of motif length. The performance of this algorithm is comparable to the best existing motif finding algorithms on a large benchmark set of samples. Another algorithm with further post processing, PosMotif, is proposed to use a string representation that allows arbitrary ignored positions within the non-conserved portion of single motifs, and use Markov chains to model the background distributions of motifs of certain length while skipping these positions within each Markov chain. Two post processing steps considering redundancy information are applied in this algorithm. PosMotif demonstrates an improved performance compared to the best five existing motif finding algorithms on several large benchmark sets of samples. The third method, Enrich, is proposed to improve the performance of general motif finding algorithms by adding more sequences to the samples in the existing benchmark datasets. Five famous motif finding algorithms have been chosen to run on the original datasets and the enriched datasets, and the performance comparisons show a general great improvement on the enriched datasets.Sze, Sing-Hoi2012-02-14T22:18:37Z2012-02-16T16:14:40Z2012-02-14T22:18:37Z2012-02-16T16:14:40Z2010-122012-02-14December 2010thesistextapplication/pdfhttp://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8834en_US
collection NDLTD
language en_US
format Others
sources NDLTD
topic Computational Biology
Motif finding
Transcription
spellingShingle Computational Biology
Motif finding
Transcription
Zhao, Xiaoyan
Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences
description Understanding the mechanisms that regulate gene expression is a major challenge in biology. One of the most important tasks in this challenge is to identify the transcription factors binding sites (TFBS) in DNA sequences. The common representation of these binding sites is called “motif” and the discovery of TFBS problem is also referred as motif finding problem in computer science. Despite extensive efforts in the past decade, none of the existing algorithms perform very well. This dissertation focuses on this difficult problem and proposes three new methods (MotifEnumerator, PosMotif, and Enrich) with excellent improvements. An improved pattern-driven algorithm, MotifEnumerator, is first proposed to detect the optimal motif with reduced time complexity compared to the traditional exact pattern-driven approaches. This strategy is further extended to allow arbitrary don’t care positions within a motif without much decrease in solvable values of motif length. The performance of this algorithm is comparable to the best existing motif finding algorithms on a large benchmark set of samples. Another algorithm with further post processing, PosMotif, is proposed to use a string representation that allows arbitrary ignored positions within the non-conserved portion of single motifs, and use Markov chains to model the background distributions of motifs of certain length while skipping these positions within each Markov chain. Two post processing steps considering redundancy information are applied in this algorithm. PosMotif demonstrates an improved performance compared to the best five existing motif finding algorithms on several large benchmark sets of samples. The third method, Enrich, is proposed to improve the performance of general motif finding algorithms by adding more sequences to the samples in the existing benchmark datasets. Five famous motif finding algorithms have been chosen to run on the original datasets and the enriched datasets, and the performance comparisons show a general great improvement on the enriched datasets.
author2 Sze, Sing-Hoi
author_facet Sze, Sing-Hoi
Zhao, Xiaoyan
author Zhao, Xiaoyan
author_sort Zhao, Xiaoyan
title Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences
title_short Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences
title_full Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences
title_fullStr Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences
title_full_unstemmed Improved Algorithms for Discovery of Transcription Factor Binding Sites in DNA Sequences
title_sort improved algorithms for discovery of transcription factor binding sites in dna sequences
publishDate 2012
url http://hdl.handle.net/1969.1/ETD-TAMU-2010-12-8834
work_keys_str_mv AT zhaoxiaoyan improvedalgorithmsfordiscoveryoftranscriptionfactorbindingsitesindnasequences
_version_ 1716505115228110848