Improving protein function prediction methods with integrated literature data

<p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in th...

Full description

Bibliographic Details
Main Authors: Gabow Aaron P, Leach Sonia M, Baumgartner William A, Hunter Lawrence E, Goldberg Debra S
Format: Article
Language:English
Published: BMC 2008-04-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/9/198
id doaj-042c8550b6e9457594c42e2891768f12
record_format Article
spelling doaj-042c8550b6e9457594c42e2891768f122020-11-25T02:45:26ZengBMCBMC Bioinformatics1471-21052008-04-019119810.1186/1471-2105-9-198Improving protein function prediction methods with integrated literature dataGabow Aaron PLeach Sonia MBaumgartner William AHunter Lawrence EGoldberg Debra S<p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.</p> <p>Results</p> <p>We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.</p> <p>Conclusion</p> <p>Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p> http://www.biomedcentral.com/1471-2105/9/198
collection DOAJ
language English
format Article
sources DOAJ
author Gabow Aaron P
Leach Sonia M
Baumgartner William A
Hunter Lawrence E
Goldberg Debra S
spellingShingle Gabow Aaron P
Leach Sonia M
Baumgartner William A
Hunter Lawrence E
Goldberg Debra S
Improving protein function prediction methods with integrated literature data
BMC Bioinformatics
author_facet Gabow Aaron P
Leach Sonia M
Baumgartner William A
Hunter Lawrence E
Goldberg Debra S
author_sort Gabow Aaron P
title Improving protein function prediction methods with integrated literature data
title_short Improving protein function prediction methods with integrated literature data
title_full Improving protein function prediction methods with integrated literature data
title_fullStr Improving protein function prediction methods with integrated literature data
title_full_unstemmed Improving protein function prediction methods with integrated literature data
title_sort improving protein function prediction methods with integrated literature data
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2008-04-01
description <p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.</p> <p>Results</p> <p>We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.</p> <p>Conclusion</p> <p>Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p>
url http://www.biomedcentral.com/1471-2105/9/198
work_keys_str_mv AT gabowaaronp improvingproteinfunctionpredictionmethodswithintegratedliteraturedata
AT leachsoniam improvingproteinfunctionpredictionmethodswithintegratedliteraturedata
AT baumgartnerwilliama improvingproteinfunctionpredictionmethodswithintegratedliteraturedata
AT hunterlawrencee improvingproteinfunctionpredictionmethodswithintegratedliteraturedata
AT goldbergdebras improvingproteinfunctionpredictionmethodswithintegratedliteraturedata
_version_ 1724762890217455616