Identification of protein functions using a machine-learning approach based on sequence-derived properties

<p>Abstract</p> <p>Background</p> <p>Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequence...

Full description

Bibliographic Details
Main Authors: Oh Hae, Oh Young, Shin Moon, Lee Bum, Ryu Keun
Format: Article
Language:English
Published: BMC 2009-08-01
Series:Proteome Science
Online Access:http://www.proteomesci.com/content/7/1/27
id doaj-75ce8ea859ac479d8c235c0560e9f272
record_format Article
spelling doaj-75ce8ea859ac479d8c235c0560e9f2722020-11-24T21:04:44ZengBMCProteome Science1477-59562009-08-01712710.1186/1477-5956-7-27Identification of protein functions using a machine-learning approach based on sequence-derived propertiesOh HaeOh YoungShin MoonLee BumRyu Keun<p>Abstract</p> <p>Background</p> <p>Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.</p> <p>Results</p> <p>A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function.</p> <p>Conclusion</p> <p>We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new <it>PNPRD </it>features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.</p> http://www.proteomesci.com/content/7/1/27
collection DOAJ
language English
format Article
sources DOAJ
author Oh Hae
Oh Young
Shin Moon
Lee Bum
Ryu Keun
spellingShingle Oh Hae
Oh Young
Shin Moon
Lee Bum
Ryu Keun
Identification of protein functions using a machine-learning approach based on sequence-derived properties
Proteome Science
author_facet Oh Hae
Oh Young
Shin Moon
Lee Bum
Ryu Keun
author_sort Oh Hae
title Identification of protein functions using a machine-learning approach based on sequence-derived properties
title_short Identification of protein functions using a machine-learning approach based on sequence-derived properties
title_full Identification of protein functions using a machine-learning approach based on sequence-derived properties
title_fullStr Identification of protein functions using a machine-learning approach based on sequence-derived properties
title_full_unstemmed Identification of protein functions using a machine-learning approach based on sequence-derived properties
title_sort identification of protein functions using a machine-learning approach based on sequence-derived properties
publisher BMC
series Proteome Science
issn 1477-5956
publishDate 2009-08-01
description <p>Abstract</p> <p>Background</p> <p>Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.</p> <p>Results</p> <p>A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function.</p> <p>Conclusion</p> <p>We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new <it>PNPRD </it>features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.</p>
url http://www.proteomesci.com/content/7/1/27
work_keys_str_mv AT ohhae identificationofproteinfunctionsusingamachinelearningapproachbasedonsequencederivedproperties
AT ohyoung identificationofproteinfunctionsusingamachinelearningapproachbasedonsequencederivedproperties
AT shinmoon identificationofproteinfunctionsusingamachinelearningapproachbasedonsequencederivedproperties
AT leebum identificationofproteinfunctionsusingamachinelearningapproachbasedonsequencederivedproperties
AT ryukeun identificationofproteinfunctionsusingamachinelearningapproachbasedonsequencederivedproperties
_version_ 1716770014495768576