Summary: | 碩士 === 國立臺灣海洋大學 === 資訊工程學系 === 97 === The prediction of protein subcellular localization (PSL) has become a popular field in recent years because it can help protein function prediction and genome annotation, and thus aid the drug design. However, the experimental methods for analyzing PSL are often expensive and time-consuming tasks. Therefore, the computational prediction of PSL, with the use of information in databases, has become a vibrant field of study. Nevertheless, it is still a tough task to extract suitable features from proteins for accurate prediction of PSL due to the complex structures of proteins. Consequently, for improving prediction performance on PSL problem, several modern PSL prediction systems apply multi-feature based protein descriptors and adopt hybrid complex prediction systems to classify and predict PSL. Even though, these systems possess outstanding prediction performance, few of them provide protein characteristics and bases of classification for further analysis. Therefore, in this thesis, a PSL prediction system, PSL-PR-CPR (Protein Subcellular Localization PredictoR and Characteristic ProvideR), which aims to provide more protein characteristics for analysis, is proposed.
In PSL-PR-CPR system, proteins are encoded into feature vectors by using a protein descriptor, AAwindow, which uses Amino Acid Index (AAI) to describe proteins in a simple and easy-understood way. In order to derive a prediction model which has a high prediction performance, PSL-PR-CPR employs MG-PSO-DS, an evolutionary computation algorithm, for doing feature selection to select appropriate feature sets that are suitable for C4.5 classifier to classify and predict PSL. MG-PSO-DS is also applied to optimize C4.5 prediction performance by tuning C4.5 parameters. The PSL-PR-CPR displays C4.5 decision rules and provides protein features that assist protein analysis after constructing the prediction model. In addition, PSL-PR-CPR shows the characteristics of important features within amino acid sequence according to the easy-understood property of AAwindow for the purpose of providing more information for analysis reference. For prediction performance validation, two datasets were applied to compare the prediction performance of PSL-PR-CPR, Mycobacterial PSL predictor, Gpos-PLoc, CELLO and LocateP at the end of this thesis. The two datasets are 852 mycobacterial proteins from the study of Mycobacterial PSL predictor and 452 Gram-positive bacterial proteins from the study of Gpos-PLoc. The 5 fold cross validation and the 10 fold cross validation are used to validate PSL-PR-CPR performance on 852 mycobacterial proteins and 452 Gram-positive bacterial proteins, respectively. PSL-PR-CPR also provides samples of C4.5 decision rules, important features and characteristics within amino acid sequence.
|