Summary: | 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 99 === Proteins that bind specific DNA sequences play important roles in regulating gene expression. Identifying target sequences of a DNA-binding protein helps to understand how genes are regulated in cells and explain how genetic variations cause disruption of normal gene expression. Position frequency matrices (PFMs) are one of the most widely used models to represent such target sequences. However, up to now, for most species, only a small fraction of the transcription factors (TFs) have experimentally determined PFMs. Since biological experiments usually require much time and cost, it is strongly desired to develop computational methods with satisfied accuracies to speedup the progress. Here, a new method based on existing protein-DNA complex structures and the knowledgebase containing the preference of contacts between amino acids and nucleotides is proposed to predict quantitative specificities of protein-DNA interactions. When given a query protein sequence, a protein-DNA complex structure of homologues proteins is selected and the PFM prediction is made based on the selected template incorporated with the built knowledgebase.
The proposed method is evaluated by two datasets and compared with existing computational methods. It turns out that the proposed method can predict as well as the compared structure-based methods. On the other hand, when a sequence-based method that is trained by collected experimentally determined PFMs is compared, the proposed method performs slightly worse. Even though, the proposed method still has its value since different predictors usually have their own advantages and limitations. In summary, it is concluded that a DNA-binding protein’s binding preference can be predicted based on its primary structure using the complexes of its homologues. This facilitates related studies in the future because target sequences of proteins without a solved structure could be predicted now.
|