Natural language processing in text mining for structural modeling of protein complexes

Abstract Background Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints o...

Full description

Bibliographic Details
Main Authors:	Varsha D. Badal, Petras J. Kundrotas, Ilya A. Vakser
Format:	Article
Language:	English
Published:	BMC 2018-03-01
Series:	BMC Bioinformatics
Subjects:	Protein interactions Binding site prediction Protein docking Dependency parser Rule-based system Supervised learning
Online Access:	http://link.springer.com/article/10.1186/s12859-018-2079-4

id	doaj-fe32a6b9a16d4c9b8f5f91062e1e42a1
record_format	Article
spelling	doaj-fe32a6b9a16d4c9b8f5f91062e1e42a12020-11-25T00:26:21ZengBMCBMC Bioinformatics1471-21052018-03-0119111010.1186/s12859-018-2079-4Natural language processing in text mining for structural modeling of protein complexesVarsha D. Badal0Petras J. Kundrotas1Ilya A. Vakser2Center for Computational Biology and Department of Molecular Biosciences, The University of KansasCenter for Computational Biology and Department of Molecular Biosciences, The University of KansasCenter for Computational Biology and Department of Molecular Biosciences, The University of KansasAbstract Background Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking. Results We present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP. Conclusions The basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP.http://link.springer.com/article/10.1186/s12859-018-2079-4Protein interactionsBinding site predictionProtein dockingDependency parserRule-based systemSupervised learning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Varsha D. Badal Petras J. Kundrotas Ilya A. Vakser
spellingShingle	Varsha D. Badal Petras J. Kundrotas Ilya A. Vakser Natural language processing in text mining for structural modeling of protein complexes BMC Bioinformatics Protein interactions Binding site prediction Protein docking Dependency parser Rule-based system Supervised learning
author_facet	Varsha D. Badal Petras J. Kundrotas Ilya A. Vakser
author_sort	Varsha D. Badal
title	Natural language processing in text mining for structural modeling of protein complexes
title_short	Natural language processing in text mining for structural modeling of protein complexes
title_full	Natural language processing in text mining for structural modeling of protein complexes
title_fullStr	Natural language processing in text mining for structural modeling of protein complexes
title_full_unstemmed	Natural language processing in text mining for structural modeling of protein complexes
title_sort	natural language processing in text mining for structural modeling of protein complexes
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2018-03-01
description	Abstract Background Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking. Results We present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP. Conclusions The basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP.
topic	Protein interactions Binding site prediction Protein docking Dependency parser Rule-based system Supervised learning
url	http://link.springer.com/article/10.1186/s12859-018-2079-4
work_keys_str_mv	AT varshadbadal naturallanguageprocessingintextminingforstructuralmodelingofproteincomplexes AT petrasjkundrotas naturallanguageprocessingintextminingforstructuralmodelingofproteincomplexes AT ilyaavakser naturallanguageprocessingintextminingforstructuralmodelingofproteincomplexes
_version_	1725344612280696832

Natural language processing in text mining for structural modeling of protein complexes

Similar Items