Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
Abstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2021-04-01
|
Series: | BMC Bioinformatics |
Online Access: | https://doi.org/10.1186/s12859-021-04111-w |
id |
doaj-549601f15610456d8d12452c37ec1dad |
---|---|
record_format |
Article |
spelling |
doaj-549601f15610456d8d12452c37ec1dad2021-04-18T11:51:37ZengBMCBMC Bioinformatics1471-21052021-04-0122111610.1186/s12859-021-04111-wPerformance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasksYang Wang0Zhanchao Li1Yanfei Zhang2Yingjun Ma3Qixing Huang4Xingyu Chen5Zong Dai6Xiaoyong Zou7School of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry and Chemical Engineering, Guangdong Pharmaceutical UniversitySchool of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry and Chemical Engineering, Guangdong Pharmaceutical UniversitySchool of Chemistry and Chemical Engineering, Guangdong Pharmaceutical UniversitySchool of Chemistry, Sun Yat-Sen UniversitySchool of Chemistry, Sun Yat-Sen UniversityAbstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. Results We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. Conclusion The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ .https://doi.org/10.1186/s12859-021-04111-w |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Yang Wang Zhanchao Li Yanfei Zhang Yingjun Ma Qixing Huang Xingyu Chen Zong Dai Xiaoyong Zou |
spellingShingle |
Yang Wang Zhanchao Li Yanfei Zhang Yingjun Ma Qixing Huang Xingyu Chen Zong Dai Xiaoyong Zou Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks BMC Bioinformatics |
author_facet |
Yang Wang Zhanchao Li Yanfei Zhang Yingjun Ma Qixing Huang Xingyu Chen Zong Dai Xiaoyong Zou |
author_sort |
Yang Wang |
title |
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_short |
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_full |
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_fullStr |
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_full_unstemmed |
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_sort |
performance improvement for a 2d convolutional neural network by using ssc encoding on protein–protein interaction tasks |
publisher |
BMC |
series |
BMC Bioinformatics |
issn |
1471-2105 |
publishDate |
2021-04-01 |
description |
Abstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. Results We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. Conclusion The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ . |
url |
https://doi.org/10.1186/s12859-021-04111-w |
work_keys_str_mv |
AT yangwang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT zhanchaoli performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT yanfeizhang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT yingjunma performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT qixinghuang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT xingyuchen performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT zongdai performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT xiaoyongzou performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks |
_version_ |
1721521817450971136 |