Identification of useful features for predicting linear B-cell epitopes by machine-learning approaches

博士 === 國立陽明大學 === 生物醫學資訊研究所 === 100 === B-cell epitopes are antigenic determinants, which are recognized and bound by B-cell receptors or antibodies. The synthetic peptide of linear B-cell epitope can help the development of peptide vaccines or can be used to induce the production of corresponding a...

Full description

Bibliographic Details
Main Authors: Chun-Hung Su, 蘇俊泓
Other Authors: I-Fang Chung
Format: Others
Language:en_US
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/13713044477546031025
Description
Summary:博士 === 國立陽明大學 === 生物醫學資訊研究所 === 100 === B-cell epitopes are antigenic determinants, which are recognized and bound by B-cell receptors or antibodies. The synthetic peptide of linear B-cell epitope can help the development of peptide vaccines or can be used to induce the production of corresponding antibody. In the past three decades, lots studies used amino acid propensities to investigate their roles correlated with the location of linear B-cell epitope. However, they could achieve neither satisfied predicting performance nor large scale analysis to support their results. Although many machine-learning approaches have been applied in the prediction of linear B-cell epitope after the emergence of linear B-cell epitope databases, there is no any method which can treat a group of related information as a single entity and select useful propensities related to linear B-cell epitopes, and uses them to predict epitopes. To solve the above problems, first, we applied a novel algorithm Group Feature Selecting Multilayered Perceptron (GFSMLP) with eight widely used amino acid propensities in four data sets. We used GFSMLP to rank propensities by the frequency with which they were selected. Then, we adopted k-means clustering to cluster the selected optimal amino acid propensity and used it to form the amino acid triplet. We calculated the difference of occurrence frequency of each triplet between positive and negative datasets and then combine the values with amino acid pairs’ values from a modified Chen’s AAP approach to encode the epitope sequences. We adopted both Support Vector Machine (SVM) and Random Forests classifiers in the classification process and used a two-level 5-fold cross-validation to find the optimal parameters for the classifiers to get the non-biased performance. Based on the results of GFSMLP, the selected propensities are indeed good features and show their stable performance in the different datasets to enhance the discriminating power for predicting linear B-cell epitopes. So far, our modified encoding approaches achieve the best predicting performance while comparing with the published researches. The accuracy (77.01%) has been raised about 6% in the prediction of linear B-cell epitopes. A graphical-user-interface version of GFSMLP is available at: http://bio.classcloud.org/ GFSMLP/.