Predicting Protein Stability Free Energy Change upon Mutations Using Machine Learning Methods

碩士 === 國立臺灣海洋大學 === 資訊工程學系 === 98 === A mutation may change the stability of a protein structure, which is an extremely important issue in the study of protein structure. An accurate prediction of protein stability free energy change (ΔΔG) helps the protein design process and provides a more reliabl...

Full description

Bibliographic Details
Main Authors: Gan-Lin Chen, 陳甘霖
Other Authors: Eric Y. T. Juan
Format: Others
Language:zh-TW
Published: 2010
Online Access:http://ndltd.ncl.edu.tw/handle/62324447373165856494
Description
Summary:碩士 === 國立臺灣海洋大學 === 資訊工程學系 === 98 === A mutation may change the stability of a protein structure, which is an extremely important issue in the study of protein structure. An accurate prediction of protein stability free energy change (ΔΔG) helps the protein design process and provides a more reliable reference for the study of protein structure. This work uses machine learning methods to predict ΔΔG starting from the protein sequence and experimental mutation thermodynamic data sets. This work uses four methods to convert a protein sequence into a feature vector, and a number of machine learning algorithms such as Decision Trees, Support Vector Machines, Nearest Neighbors, Random Forests, etc. Five datasets adopted from the ProTherm database includes four datasets (SEQDB, NewDB982, NewDB667 and NewDB1313) for a single point mutation and another dataset (DM180) for a double point mutation. The methods used in this work can compete with state-of-the-art systems on the prediction accuracy. For the prediction of single point mutation, ΔΔG is discriminated between 3 classes: destabilizing, neutral and stabilizing mutation. Using 20-fold cross-validation on the SEQDB dataset, an M-AAwindow-based Random Forests classifier achieves an overall accuracy of 73% and a mean value correlation coefficient (MCC) of 0.53. An M-AAwindow-based Random Forests classifier is tested on these datasets (NewDB982, NewDB667 and NewDB1313), with an overall accuracy of 59% , 64% and 64% , respectively. For the prediction of a double point mutation, ΔΔG is discriminated between 2 classes: destabilizing and stabilizing mutation. ΔΔG is discriminated between 12 classes by two models based on C4.5 decision trees for the first point mutation and the second point mutation, respectively. Furthermore, A K-Nearest Neighbors classifier makes a prediction by combining the outcome of individual models for discriminating between destabilizing mutation and stabilizing mutation, with an overall accuracy of 83.3%. The experimental results of a single point mutation and a double point mutation showed that the classifiers based on M-AAwindow have better performance for the prediction of ΔΔG.