An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.
As the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN....
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2019-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0225317 |
id |
doaj-c7019a5e6934437c9c97c0f06823c869 |
---|---|
record_format |
Article |
spelling |
doaj-c7019a5e6934437c9c97c0f06823c8692021-03-03T21:17:14ZengPublic Library of Science (PLoS)PLoS ONE1932-62032019-01-011411e022531710.1371/journal.pone.0225317An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.Siquan HuRuixiong MaHaiou WangAs the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN. However, these methods do not consider the context in amino acid sequences, which makes it difficult for them to adequately capture sequence features. In this study, a new method that coordinates a bidirectional long-term memory recurrent neural network and a convolutional neural network, called CNN-BiLSTM, is proposed to identify DNA binding proteins. The CNN-BiLSTM model can explore the potential contextual relationships of amino acid sequences and obtain more features than can traditional models. The experimental results show that the CNN-BiLSTM achieves a validation set prediction accuracy of 96.5%-7.8% higher than that of SVM, 9.6% higher than that of DNABP and 3.7% higher than that of CNN-RNN. After testing on 20,000 independent samples provided by UniProt that were not involved in model training, the accuracy of CNN-BiLSTM reached 94.5%-12% higher than that of SVM, 4.9% higher than that of DNABP and 4% higher than that of CNN-RNN. We visualized and compared the model training process of CNN-BiLSTM with that of CNN-RNN and found that the former is capable of better generalization from the training dataset, showing that CNN-BiLSTM has a wider range of adaptations to protein sequences. On the test set, CNN-BiLSTM has better credibility because its predicted scores are closer to the sample labels than are those of CNN-RNN. Therefore, the proposed CNN-BiLSTM is a more powerful method for identifying DNA-binding proteins.https://doi.org/10.1371/journal.pone.0225317 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Siquan Hu Ruixiong Ma Haiou Wang |
spellingShingle |
Siquan Hu Ruixiong Ma Haiou Wang An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. PLoS ONE |
author_facet |
Siquan Hu Ruixiong Ma Haiou Wang |
author_sort |
Siquan Hu |
title |
An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. |
title_short |
An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. |
title_full |
An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. |
title_fullStr |
An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. |
title_full_unstemmed |
An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences. |
title_sort |
improved deep learning method for predicting dna-binding proteins based on contextual features in amino acid sequences. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS ONE |
issn |
1932-6203 |
publishDate |
2019-01-01 |
description |
As the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN. However, these methods do not consider the context in amino acid sequences, which makes it difficult for them to adequately capture sequence features. In this study, a new method that coordinates a bidirectional long-term memory recurrent neural network and a convolutional neural network, called CNN-BiLSTM, is proposed to identify DNA binding proteins. The CNN-BiLSTM model can explore the potential contextual relationships of amino acid sequences and obtain more features than can traditional models. The experimental results show that the CNN-BiLSTM achieves a validation set prediction accuracy of 96.5%-7.8% higher than that of SVM, 9.6% higher than that of DNABP and 3.7% higher than that of CNN-RNN. After testing on 20,000 independent samples provided by UniProt that were not involved in model training, the accuracy of CNN-BiLSTM reached 94.5%-12% higher than that of SVM, 4.9% higher than that of DNABP and 4% higher than that of CNN-RNN. We visualized and compared the model training process of CNN-BiLSTM with that of CNN-RNN and found that the former is capable of better generalization from the training dataset, showing that CNN-BiLSTM has a wider range of adaptations to protein sequences. On the test set, CNN-BiLSTM has better credibility because its predicted scores are closer to the sample labels than are those of CNN-RNN. Therefore, the proposed CNN-BiLSTM is a more powerful method for identifying DNA-binding proteins. |
url |
https://doi.org/10.1371/journal.pone.0225317 |
work_keys_str_mv |
AT siquanhu animproveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences AT ruixiongma animproveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences AT haiouwang animproveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences AT siquanhu improveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences AT ruixiongma improveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences AT haiouwang improveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences |
_version_ |
1714817753819381760 |