Improving protein function prediction methods with integrated literature data
<p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in th...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2008-04-01
|
Series: | BMC Bioinformatics |
Online Access: | http://www.biomedcentral.com/1471-2105/9/198 |
id |
doaj-042c8550b6e9457594c42e2891768f12 |
---|---|
record_format |
Article |
spelling |
doaj-042c8550b6e9457594c42e2891768f122020-11-25T02:45:26ZengBMCBMC Bioinformatics1471-21052008-04-019119810.1186/1471-2105-9-198Improving protein function prediction methods with integrated literature dataGabow Aaron PLeach Sonia MBaumgartner William AHunter Lawrence EGoldberg Debra S<p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.</p> <p>Results</p> <p>We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.</p> <p>Conclusion</p> <p>Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p> http://www.biomedcentral.com/1471-2105/9/198 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Gabow Aaron P Leach Sonia M Baumgartner William A Hunter Lawrence E Goldberg Debra S |
spellingShingle |
Gabow Aaron P Leach Sonia M Baumgartner William A Hunter Lawrence E Goldberg Debra S Improving protein function prediction methods with integrated literature data BMC Bioinformatics |
author_facet |
Gabow Aaron P Leach Sonia M Baumgartner William A Hunter Lawrence E Goldberg Debra S |
author_sort |
Gabow Aaron P |
title |
Improving protein function prediction methods with integrated literature data |
title_short |
Improving protein function prediction methods with integrated literature data |
title_full |
Improving protein function prediction methods with integrated literature data |
title_fullStr |
Improving protein function prediction methods with integrated literature data |
title_full_unstemmed |
Improving protein function prediction methods with integrated literature data |
title_sort |
improving protein function prediction methods with integrated literature data |
publisher |
BMC |
series |
BMC Bioinformatics |
issn |
1471-2105 |
publishDate |
2008-04-01 |
description |
<p>Abstract</p> <p>Background</p> <p>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.</p> <p>Results</p> <p>We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.</p> <p>Conclusion</p> <p>Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p> |
url |
http://www.biomedcentral.com/1471-2105/9/198 |
work_keys_str_mv |
AT gabowaaronp improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT leachsoniam improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT baumgartnerwilliama improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT hunterlawrencee improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT goldbergdebras improvingproteinfunctionpredictionmethodswithintegratedliteraturedata |
_version_ |
1724762890217455616 |