Summary: | 碩士 === 國立交通大學 === 生醫工程研究所 === 101 === Protein tyrosine sulfation is one of the common post-translation modifications. Identifying the tyrosine sulfation sites is important for biologists to predict biochemical interactions. However, the determinant features of tyrosine sulfation sites are unknown. Moreover, the number of experimental sulfotyrosine sites is few, and the number of non-sulfotyrosine sites is 26 times more than the number of sulfotyrosine sites. The thesis presents a prediction method based on support vector machine (SVM) with amino acid sequence encoded by pairwise position weighted matrix (PPWM) to predict tyrosine sulfation sites. Due to the number of sulfotyrosine sites are less than non-sulfotyrosine sites, we incorporates resampling of training data to build multiple SVM models. The final prediction is made by a voting mechanism from those models. A single SVM model achieves an accuracy of 99.2% in average under five-fold cross validation. The proposed method achieves an accuracy of 98.3% when testing all known tyrosine sites with voting. In addition, we discovered that some patterns such as acidic amino acid occurs on each side of tyrosine residue, and Tryptophan (W) couples with acidic amino acid occur more frequently within sulfotyrosine subsequence by analyzing PPWM. The results may help biologists to discover tyrosine sulfation.
|