A tree based algorithm for predicting protein-DNA binding cores.

轉錄因子(TF) 和轉錄因子結合位點(TFBS) 之間的結合(binding) 是重要的生物信息學課題。高清晰度(長度<10 )的結合核心(binding core) 是從昂貴和費時的三維結構實驗中發現的。因此,我們希望開發一種以序列為基礎的高效計算方法,提供高信心的結合核心作為實驗對象,以提高三維結構實驗的效率。雖然現有很多基於序列的motif辨認算法,但很少有直接針對關聯TF和TFBS的結合核心的。在不使用任何三維結構的結合核心下,最近我們應用了關聯規則挖掘方法於低分辨率的(TF長度>490) 結合序列準確地發掘出高清晰度結合核心,然而,這種方法有幾個缺點。在這篇論文中,我們正...

Full description

Bibliographic Details
Other Authors: Wong, Po Yuen.
Format: Others
Language:English
Chinese
Published: 2012
Subjects:
Online Access:http://library.cuhk.edu.hk/record=b5549042
http://repository.lib.cuhk.edu.hk/en/item/cuhk-328756
Description
Summary:轉錄因子(TF) 和轉錄因子結合位點(TFBS) 之間的結合(binding) 是重要的生物信息學課題。高清晰度(長度<10 )的結合核心(binding core) 是從昂貴和費時的三維結構實驗中發現的。因此,我們希望開發一種以序列為基礎的高效計算方法,提供高信心的結合核心作為實驗對象,以提高三維結構實驗的效率。雖然現有很多基於序列的motif辨認算法,但很少有直接針對關聯TF和TFBS的結合核心的。在不使用任何三維結構的結合核心下,最近我們應用了關聯規則挖掘方法於低分辨率的(TF長度>490) 結合序列準確地發掘出高清晰度結合核心,然而,這種方法有幾個缺點。在這篇論文中,我們正式地定義了使用關聯規則挖掘預測蛋白質-脫氧核糖核酸(DNA) 結合核心的問題和開發了一個以樹為基礎的算法以克服前一種方法的缺點。 === 目前的關聯規則挖掘方法在這個問題上只能解決確切的序列,而最近的近似方法並沒有採用任何正式的模型,並且受限於實驗已知的序列。由於生物的基因突變是常見的,因此我們進一步定義開採近似的蛋白質-DNA序列結合核心的問題,並延伸該算法至預測近似的蛋白質-DNA結合核心。真實數據的實驗結果中表明了在該算法在預測新的TF-TFBS結合核心中的性能和適用性。最後,我們提出、測試並討論了多種減少雜訊以提高結果質量的方案。其中,當最小支持度(minimumsupport) 的限制定得低時,統計檢驗能有效地從結果中删除雜訊。 === The studies of protein-DNA bindings between transcription fac-tors (TFs) and transcription factor binding sites (TFBSs) are important bioinformatics topics. Currently, high-resolution (length < 10) TF-TFBS binding cores are discovered by expensive and time-consuming 3D structure experiments. Thus, we are motivated to develop a cheap and efficient sequence-based computational method for providing testable novel binding cores with high condence to accelerate the experiments. Although there are abundant sequence-based motif discovery algorithms, few directly address associating both TF and TFBS core motifs, which are both veriable on 3D structures. Recent association rule mining approaches on low-resolution binding sequences (TF length > 490) are shown promising in identifying accurate binding cores without using any 3D structures, however, the approach has several drawbacks. In this thesis, the problem of predicting protein-DNA binding cores using association rule mining is formally dened and a novel tree-based algorithm is developed to overcome the disadvantages of the previous approach. === While the previous association rule mining method on this problem addresses exact sequences only, the most recent ad hoc method for approximation does not establish any formal model and is limited by experimentally known patterns. As biological mutations are common, it is desirable to formally extend the exact model into an approximate one. Thus, we further formalize the problem of mining approximate protein-DNA association rules from sequence data and extend the proposed algorithm to predict approximate protein-DNA binding cores. Experimental results on real data show the performance and applicability of the proposed algorithm in predicting novel TF-TFBS binding cores. Finally, several methods for reducing noise and thus improving the quality of the mined rules are proposed and discussed. Particularly, statistical tests give impressive result on removing noise when the minimum support threshold is small. === Detailed summary in vernacular field only. === Detailed summary in vernacular field only. === Wong, Po Yuen. === Thesis (M.Phil.)--Chinese University of Hong Kong, 2012. === Includes bibliographical references (leaves 126-136). === Abstracts also in Chinese. === Abstract --- p.i === Acknowledgement --- p.vi === Chapter 1 --- Introduction --- p.1 === Chapter 1.1 --- Predicting Protein-DNA Binding Cores --- p.1 === Chapter 1.2 --- Contributions --- p.3 === Chapter 1.3 --- Thesis Outline --- p.4 === Chapter 2 --- Background --- p.6 === Chapter 2.1 --- Biological Background --- p.7 === Chapter 2.1.1 --- The Central Dogma of Molecular Biology --- p.7 === Chapter 2.1.2 --- Transcriptional Regulation --- p.10 === Chapter 2.1.3 --- Experiments on studying TF-TFBS bindings --- p.12 === Chapter 2.2 --- Computational Background --- p.13 === Chapter 2.2.1 --- Motif Discovery --- p.13 === Chapter 2.2.2 --- Association Rule Mining --- p.14 === Chapter 2.2.3 --- Frequent Pattern Mining --- p.16 === Chapter 2.3 --- TF-TFBS Binding Rule Mining in Bioinformatics --- p.17 === Chapter 3 --- Mining TF-TFBS Rules --- p.23 === Chapter 3.1 --- Introduction --- p.24 === Chapter 3.2 --- Problem Definition --- p.25 === Chapter 3.3 --- Frequent Sequence Tree (FS-Tree) --- p.31 === Chapter 3.3.1 --- Semantic of FS-Tree --- p.31 === Chapter 3.3.2 --- Construction of FS-Tree --- p.34 === Chapter 3.4 --- The algorithm --- p.40 === Chapter 3.4.1 --- Correctness --- p.42 === Chapter 3.5 --- Results --- p.44 === Chapter 3.5.1 --- Performance --- p.45 === Chapter 3.5.2 --- Verification using 3D-Structures --- p.53 === Chapter 3.6 --- Discussion and Conclusion --- p.58 === Chapter 3.6.1 --- Parameters Setting --- p.59 === Chapter 3.6.2 --- Deduplication --- p.60 === Chapter 4 --- Extension to Approximate TF-TFBS Rules --- p.63 === Chapter 4.1 --- Introduction --- p.65 === Chapter 4.2 --- Problem Definition --- p.66 === Chapter 4.3 --- Frequent Sequence Class Tree --- p.74 === Chapter 4.4 --- The extended algorithm --- p.82 === Chapter 4.4.1 --- Correctness --- p.87 === Chapter 4.5 --- Results --- p.89 === Chapter 4.5.1 --- Performance --- p.89 === Chapter 4.5.2 --- Verification using PDB --- p.94 === Chapter 4.6 --- Discussion and Conclusion --- p.100 === Chapter 5 --- Noise Reducing Methods --- p.102 === Chapter 5.1 --- Introduction --- p.103 === Chapter 5.2 --- Reducing Noise within a TFBS Group --- p.104 === Chapter 5.2.1 --- Using Exact Count Threshold --- p.106 === Chapter 5.2.2 --- Using Minimum Support --- p.108 === Chapter 5.2.3 --- Using Minimum Approximate Support --- p.110 === Chapter 5.3 --- Reducing Noise using Statistical Test --- p.112 === Chapter 5.3.1 --- A Simple Model --- p.114 === Chapter 5.3.2 --- Statistical Model with Transactions --- p.116 === Chapter 5.4 --- Discussion and Conclusion --- p.120 === Chapter 6 --- Conclusion --- p.121 === Chapter 6.1 --- Conclusion --- p.121 === Chapter 6.2 --- Future Work --- p.123 === Bibliography --- p.126 === Chapter A --- Publications --- p.137 === Chapter A.1 --- Publications --- p.137