Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks

Abstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive...

Full description

Bibliographic Details
Main Authors: Yang Wang, Zhanchao Li, Yanfei Zhang, Yingjun Ma, Qixing Huang, Xingyu Chen, Zong Dai, Xiaoyong Zou
Format: Article
Language:English
Published: BMC 2021-04-01
Series:BMC Bioinformatics
Online Access:https://doi.org/10.1186/s12859-021-04111-w
id doaj-549601f15610456d8d12452c37ec1dad
record_format Article
spelling doaj-549601f15610456d8d12452c37ec1dad2021-04-18T11:51:37ZengBMCBMC Bioinformatics1471-21052021-04-0122111610.1186/s12859-021-04111-wPerformance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasksYang Wang0Zhanchao Li1Yanfei Zhang2Yingjun Ma3Qixing Huang4Xingyu Chen5Zong Dai6Xiaoyong Zou7School of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry and Chemical Engineering, Guangdong Pharmaceutical UniversitySchool of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry and Chemical Engineering, Guangdong Pharmaceutical UniversitySchool of Chemistry and Chemical Engineering, Guangdong Pharmaceutical UniversitySchool of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry, Sun Yat-Sen UniversityAbstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. Results We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. Conclusion The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ .https://doi.org/10.1186/s12859-021-04111-w
collection DOAJ
language English
format Article
sources DOAJ
author Yang Wang
Zhanchao Li
Yanfei Zhang
Yingjun Ma
Qixing Huang
Xingyu Chen
Zong Dai
Xiaoyong Zou
spellingShingle Yang Wang
Zhanchao Li
Yanfei Zhang
Yingjun Ma
Qixing Huang
Xingyu Chen
Zong Dai
Xiaoyong Zou
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
BMC Bioinformatics
author_facet Yang Wang
Zhanchao Li
Yanfei Zhang
Yingjun Ma
Qixing Huang
Xingyu Chen
Zong Dai
Xiaoyong Zou
author_sort Yang Wang
title Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_short Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_full Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_fullStr Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_full_unstemmed Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_sort performance improvement for a 2d convolutional neural network by using ssc encoding on protein–protein interaction tasks
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2021-04-01
description Abstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. Results We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. Conclusion The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ .
url https://doi.org/10.1186/s12859-021-04111-w
work_keys_str_mv AT yangwang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT zhanchaoli performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT yanfeizhang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT yingjunma performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT qixinghuang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT xingyuchen performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT zongdai performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT xiaoyongzou performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
_version_ 1721521817450971136