An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.

As the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN....

Full description

Bibliographic Details
Main Authors: Siquan Hu, Ruixiong Ma, Haiou Wang
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2019-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0225317
id doaj-c7019a5e6934437c9c97c0f06823c869
record_format Article
spelling doaj-c7019a5e6934437c9c97c0f06823c8692021-03-03T21:17:14ZengPublic Library of Science (PLoS)PLoS ONE1932-62032019-01-011411e022531710.1371/journal.pone.0225317An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.Siquan HuRuixiong MaHaiou WangAs the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN. However, these methods do not consider the context in amino acid sequences, which makes it difficult for them to adequately capture sequence features. In this study, a new method that coordinates a bidirectional long-term memory recurrent neural network and a convolutional neural network, called CNN-BiLSTM, is proposed to identify DNA binding proteins. The CNN-BiLSTM model can explore the potential contextual relationships of amino acid sequences and obtain more features than can traditional models. The experimental results show that the CNN-BiLSTM achieves a validation set prediction accuracy of 96.5%-7.8% higher than that of SVM, 9.6% higher than that of DNABP and 3.7% higher than that of CNN-RNN. After testing on 20,000 independent samples provided by UniProt that were not involved in model training, the accuracy of CNN-BiLSTM reached 94.5%-12% higher than that of SVM, 4.9% higher than that of DNABP and 4% higher than that of CNN-RNN. We visualized and compared the model training process of CNN-BiLSTM with that of CNN-RNN and found that the former is capable of better generalization from the training dataset, showing that CNN-BiLSTM has a wider range of adaptations to protein sequences. On the test set, CNN-BiLSTM has better credibility because its predicted scores are closer to the sample labels than are those of CNN-RNN. Therefore, the proposed CNN-BiLSTM is a more powerful method for identifying DNA-binding proteins.https://doi.org/10.1371/journal.pone.0225317
collection DOAJ
language English
format Article
sources DOAJ
author Siquan Hu
Ruixiong Ma
Haiou Wang
spellingShingle Siquan Hu
Ruixiong Ma
Haiou Wang
An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.
PLoS ONE
author_facet Siquan Hu
Ruixiong Ma
Haiou Wang
author_sort Siquan Hu
title An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.
title_short An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.
title_full An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.
title_fullStr An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.
title_full_unstemmed An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.
title_sort improved deep learning method for predicting dna-binding proteins based on contextual features in amino acid sequences.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2019-01-01
description As the number of known proteins has expanded, how to accurately identify DNA binding proteins has become a significant biological challenge. At present, various computational methods have been proposed to recognize DNA-binding proteins from only amino acid sequences, such as SVM, DNABP and CNN-RNN. However, these methods do not consider the context in amino acid sequences, which makes it difficult for them to adequately capture sequence features. In this study, a new method that coordinates a bidirectional long-term memory recurrent neural network and a convolutional neural network, called CNN-BiLSTM, is proposed to identify DNA binding proteins. The CNN-BiLSTM model can explore the potential contextual relationships of amino acid sequences and obtain more features than can traditional models. The experimental results show that the CNN-BiLSTM achieves a validation set prediction accuracy of 96.5%-7.8% higher than that of SVM, 9.6% higher than that of DNABP and 3.7% higher than that of CNN-RNN. After testing on 20,000 independent samples provided by UniProt that were not involved in model training, the accuracy of CNN-BiLSTM reached 94.5%-12% higher than that of SVM, 4.9% higher than that of DNABP and 4% higher than that of CNN-RNN. We visualized and compared the model training process of CNN-BiLSTM with that of CNN-RNN and found that the former is capable of better generalization from the training dataset, showing that CNN-BiLSTM has a wider range of adaptations to protein sequences. On the test set, CNN-BiLSTM has better credibility because its predicted scores are closer to the sample labels than are those of CNN-RNN. Therefore, the proposed CNN-BiLSTM is a more powerful method for identifying DNA-binding proteins.
url https://doi.org/10.1371/journal.pone.0225317
work_keys_str_mv AT siquanhu animproveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences
AT ruixiongma animproveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences
AT haiouwang animproveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences
AT siquanhu improveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences
AT ruixiongma improveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences
AT haiouwang improveddeeplearningmethodforpredictingdnabindingproteinsbasedoncontextualfeaturesinaminoacidsequences
_version_ 1714817753819381760