Automatic extraction of protein point mutations using a graph bigram association.

Protein point mutations are an essential component of the evolutionary and experimental analysis of protein structure and function. While many manually curated databases attempt to index point mutations, most experimentally generated point mutations and the biological impacts of the changes are desc...

Full description

Bibliographic Details
Main Authors: Lawrence C Lee, Florence Horn, Fred E Cohen
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2007-02-01
Series:PLoS Computational Biology
Online Access:http://europepmc.org/articles/PMC1794323?pdf=render
id doaj-73e84d7ff9be4287848c5567d2d704e2
record_format Article
spelling doaj-73e84d7ff9be4287848c5567d2d704e22020-11-24T21:55:55ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582007-02-0132e1610.1371/journal.pcbi.0030016Automatic extraction of protein point mutations using a graph bigram association.Lawrence C LeeFlorence HornFred E CohenProtein point mutations are an essential component of the evolutionary and experimental analysis of protein structure and function. While many manually curated databases attempt to index point mutations, most experimentally generated point mutations and the biological impacts of the changes are described in the peer-reviewed published literature. We describe an application, Mutation GraB (Graph Bigram), that identifies, extracts, and verifies point mutations from biomedical literature. The principal problem of point mutation extraction is to link the point mutation with its associated protein and organism of origin. Our algorithm uses a graph-based bigram traversal to identify these relevant associations and exploits the Swiss-Prot protein database to verify this information. The graph bigram method is different from other models for point mutation extraction in that it incorporates frequency and positional data of all terms in an article to drive the point mutation-protein association. Our method was tested on 589 articles describing point mutations from the G protein-coupled receptor (GPCR), tyrosine kinase, and ion channel protein families. We evaluated our graph bigram metric against a word-proximity metric for term association on datasets of full-text literature in these three different protein families. Our testing shows that the graph bigram metric achieves a higher F-measure for the GPCRs (0.79 versus 0.76), protein tyrosine kinases (0.72 versus 0.69), and ion channel transporters (0.76 versus 0.74). Importantly, in situations where more than one protein can be assigned to a point mutation and disambiguation is required, the graph bigram metric achieves a precision of 0.84 compared with the word distance metric precision of 0.73. We believe the graph bigram search metric to be a significant improvement over previous search metrics for point mutation extraction and to be applicable to text-mining application requiring the association of words.http://europepmc.org/articles/PMC1794323?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Lawrence C Lee
Florence Horn
Fred E Cohen
spellingShingle Lawrence C Lee
Florence Horn
Fred E Cohen
Automatic extraction of protein point mutations using a graph bigram association.
PLoS Computational Biology
author_facet Lawrence C Lee
Florence Horn
Fred E Cohen
author_sort Lawrence C Lee
title Automatic extraction of protein point mutations using a graph bigram association.
title_short Automatic extraction of protein point mutations using a graph bigram association.
title_full Automatic extraction of protein point mutations using a graph bigram association.
title_fullStr Automatic extraction of protein point mutations using a graph bigram association.
title_full_unstemmed Automatic extraction of protein point mutations using a graph bigram association.
title_sort automatic extraction of protein point mutations using a graph bigram association.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2007-02-01
description Protein point mutations are an essential component of the evolutionary and experimental analysis of protein structure and function. While many manually curated databases attempt to index point mutations, most experimentally generated point mutations and the biological impacts of the changes are described in the peer-reviewed published literature. We describe an application, Mutation GraB (Graph Bigram), that identifies, extracts, and verifies point mutations from biomedical literature. The principal problem of point mutation extraction is to link the point mutation with its associated protein and organism of origin. Our algorithm uses a graph-based bigram traversal to identify these relevant associations and exploits the Swiss-Prot protein database to verify this information. The graph bigram method is different from other models for point mutation extraction in that it incorporates frequency and positional data of all terms in an article to drive the point mutation-protein association. Our method was tested on 589 articles describing point mutations from the G protein-coupled receptor (GPCR), tyrosine kinase, and ion channel protein families. We evaluated our graph bigram metric against a word-proximity metric for term association on datasets of full-text literature in these three different protein families. Our testing shows that the graph bigram metric achieves a higher F-measure for the GPCRs (0.79 versus 0.76), protein tyrosine kinases (0.72 versus 0.69), and ion channel transporters (0.76 versus 0.74). Importantly, in situations where more than one protein can be assigned to a point mutation and disambiguation is required, the graph bigram metric achieves a precision of 0.84 compared with the word distance metric precision of 0.73. We believe the graph bigram search metric to be a significant improvement over previous search metrics for point mutation extraction and to be applicable to text-mining application requiring the association of words.
url http://europepmc.org/articles/PMC1794323?pdf=render
work_keys_str_mv AT lawrenceclee automaticextractionofproteinpointmutationsusingagraphbigramassociation
AT florencehorn automaticextractionofproteinpointmutationsusingagraphbigramassociation
AT fredecohen automaticextractionofproteinpointmutationsusingagraphbigramassociation
_version_ 1725860555468570624