Summary: | Protein phosphorylation is an important type of post-translational modification that regulates various activities of cell life inside human body. The accurate identification of phosphorylation sites can provide new insights for revealing the specific function of protein. However, it is time-consuming and inefficient to apply the experiment-based techniques in investigating the phosphorylation sites in proteins. Additionally, computational approaches are regarded as an ideal choice in such a big data era. Therefore, in this work, we designed a new computational method to identify phosphorylation sites. At first, phosphorylation data was collected from human proteins to construct an objective and strict benchmark dataset. By a series of feature analysis, we found that the distributions of conservation scores and nine physicochemical properties surrounding the phosphorylation sites in positive samples are significantly different from those surrounding non-phosphorylation sites in negative samples. Based on these features, a novel sequence-based method for predicting the phosphorylation sites in human proteomics was proposed, which incorporated the conservation scores with position-associated attributes that reflect the correlation of physicochemical characteristics among amino acid residues. Furthermore, the analysis of variance (ANOVA) was utilized to obtain the optimal feature subset which could produce the highest accuracy. Comparison with the published predictor demonstrated the superiority of our predictor. Finally, a user-friendly online tool called iPhoPred was established and can be freely available at http://lin-group.cn/server/iPhoPred/. We hope the tool will provide important guide for the study of protein phosphorylation.
|