PREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATURE

Proteins perform many important functions in the cell and are essential to the health of the cell and the organism. As such, there is much effort to understand the function of proteins. Due to the advances in sequencing technology, there are many sequences of proteins whose function is yet unknown....

Full description

Bibliographic Details
Main Author: Wong, ANDREW
Other Authors: Queen's University (Kingston, Ont.). Theses (Queen's University (Kingston, Ont.))
Language:en
en
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/1974/7923
id ndltd-LACETR-oai-collectionscanada.gc.ca-OKQ.1974-7923
record_format oai_dc
spelling ndltd-LACETR-oai-collectionscanada.gc.ca-OKQ.1974-79232013-12-20T03:40:55ZPREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATUREWong, ANDREWcomputer scienceprotein function predictionProteins perform many important functions in the cell and are essential to the health of the cell and the organism. As such, there is much effort to understand the function of proteins. Due to the advances in sequencing technology, there are many sequences of proteins whose function is yet unknown. Therefore, computational systems are being developed and used to help predict protein function. Most computational systems represent proteins using features that are derived from protein sequence or protein structure to predict function. In contrast, there are very few systems that use the biomedical literature as a source of features. Earlier work demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. In this thesis we build on that earlier work, and examine the effectiveness of using text features to predict protein function. Using the molecular function and biological process terms from the Gene Ontology (GO) as our function classes, we trained two classifiers (k-Nearest Neighbour and Support Vector Machines) to predict protein function. The proteins were represented using text features that were extracted from biomedical abstracts based on statistical properties. For evaluation, the performance of our two classifiers was compared to that of two baseline classifiers: one that assigns function based solely on the prior distribution of protein function, and one that assigns function based on sequence similarity. The systems were trained and tested using 5-fold cross-validation over a dataset of more than 36,000 proteins. Overall, we show that text features extracted from biomedical literature can be used to predict protein function for any organism. Our results also show that our text-based classifier typically has comparable performance to the sequence-similarity baseline classifier. Based on our results and what previous work had shown, we believe that text features can be integrated with other types of features to provide more accurate predictions for protein function.Thesis (Master, Computing) -- Queen's University, 2013-04-24 21:07:13.983Queen's University (Kingston, Ont.). Theses (Queen's University (Kingston, Ont.))2013-04-24 21:07:13.9832013-04-25T14:27:07Z2013-04-25T14:27:07Z2013-04-25Thesishttp://hdl.handle.net/1974/7923enenCanadian thesesThis publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
collection NDLTD
language en
en
sources NDLTD
topic computer science
protein function prediction
spellingShingle computer science
protein function prediction
Wong, ANDREW
PREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATURE
description Proteins perform many important functions in the cell and are essential to the health of the cell and the organism. As such, there is much effort to understand the function of proteins. Due to the advances in sequencing technology, there are many sequences of proteins whose function is yet unknown. Therefore, computational systems are being developed and used to help predict protein function. Most computational systems represent proteins using features that are derived from protein sequence or protein structure to predict function. In contrast, there are very few systems that use the biomedical literature as a source of features. Earlier work demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. In this thesis we build on that earlier work, and examine the effectiveness of using text features to predict protein function. Using the molecular function and biological process terms from the Gene Ontology (GO) as our function classes, we trained two classifiers (k-Nearest Neighbour and Support Vector Machines) to predict protein function. The proteins were represented using text features that were extracted from biomedical abstracts based on statistical properties. For evaluation, the performance of our two classifiers was compared to that of two baseline classifiers: one that assigns function based solely on the prior distribution of protein function, and one that assigns function based on sequence similarity. The systems were trained and tested using 5-fold cross-validation over a dataset of more than 36,000 proteins. Overall, we show that text features extracted from biomedical literature can be used to predict protein function for any organism. Our results also show that our text-based classifier typically has comparable performance to the sequence-similarity baseline classifier. Based on our results and what previous work had shown, we believe that text features can be integrated with other types of features to provide more accurate predictions for protein function. === Thesis (Master, Computing) -- Queen's University, 2013-04-24 21:07:13.983
author2 Queen's University (Kingston, Ont.). Theses (Queen's University (Kingston, Ont.))
author_facet Queen's University (Kingston, Ont.). Theses (Queen's University (Kingston, Ont.))
Wong, ANDREW
author Wong, ANDREW
author_sort Wong, ANDREW
title PREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATURE
title_short PREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATURE
title_full PREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATURE
title_fullStr PREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATURE
title_full_unstemmed PREDICTION OF PROTEIN FUNCTION USING TEXT FEATURES EXTRACTED FROM THE BIOMEDICAL LITERATURE
title_sort prediction of protein function using text features extracted from the biomedical literature
publishDate 2013
url http://hdl.handle.net/1974/7923
work_keys_str_mv AT wongandrew predictionofproteinfunctionusingtextfeaturesextractedfromthebiomedicalliterature
_version_ 1716621655734747136