Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition

<p>Abstract</p> <p>Background</p> <p>Existing methods for predicting protein solubility on overexpression in <it>Escherichia coli </it>advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a nu...

Full description

Bibliographic Details
Main Authors: Huang Hui-Ling, Charoenkwan Phasit, Kao Te-Fen, Lee Hua-Chin, Chang Fang-Lin, Huang Wen-Lin, Ho Shinn-Jang, Shu Li-Sun, Chen Wen-Liang, Ho Shinn-Ying
Format: Article
Language:English
Published: BMC 2012-12-01
Series:BMC Bioinformatics
id doaj-e907ed8343be4f298cb390a42c9f8fe4
record_format Article
spelling doaj-e907ed8343be4f298cb390a42c9f8fe42020-11-25T02:17:44ZengBMCBMC Bioinformatics1471-21052012-12-0113Suppl 17S310.1186/1471-2105-13-S17-S3Prediction and analysis of protein solubility using a novel scoring card method with dipeptide compositionHuang Hui-LingCharoenkwan PhasitKao Te-FenLee Hua-ChinChang Fang-LinHuang Wen-LinHo Shinn-JangShu Li-SunChen Wen-LiangHo Shinn-Ying<p>Abstract</p> <p>Background</p> <p>Existing methods for predicting protein solubility on overexpression in <it>Escherichia coli </it>advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.</p> <p>Results</p> <p>This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.</p> <p>Conclusions</p> <p>The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.</p> <p>Availability</p> <p>The used datasets, source codes of SCM, and supplementary files are available at <url>http://iclab.life.nctu.edu.tw/SCM/</url>.</p>
collection DOAJ
language English
format Article
sources DOAJ
author Huang Hui-Ling
Charoenkwan Phasit
Kao Te-Fen
Lee Hua-Chin
Chang Fang-Lin
Huang Wen-Lin
Ho Shinn-Jang
Shu Li-Sun
Chen Wen-Liang
Ho Shinn-Ying
spellingShingle Huang Hui-Ling
Charoenkwan Phasit
Kao Te-Fen
Lee Hua-Chin
Chang Fang-Lin
Huang Wen-Lin
Ho Shinn-Jang
Shu Li-Sun
Chen Wen-Liang
Ho Shinn-Ying
Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
BMC Bioinformatics
author_facet Huang Hui-Ling
Charoenkwan Phasit
Kao Te-Fen
Lee Hua-Chin
Chang Fang-Lin
Huang Wen-Lin
Ho Shinn-Jang
Shu Li-Sun
Chen Wen-Liang
Ho Shinn-Ying
author_sort Huang Hui-Ling
title Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
title_short Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
title_full Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
title_fullStr Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
title_full_unstemmed Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
title_sort prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2012-12-01
description <p>Abstract</p> <p>Background</p> <p>Existing methods for predicting protein solubility on overexpression in <it>Escherichia coli </it>advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.</p> <p>Results</p> <p>This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.</p> <p>Conclusions</p> <p>The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.</p> <p>Availability</p> <p>The used datasets, source codes of SCM, and supplementary files are available at <url>http://iclab.life.nctu.edu.tw/SCM/</url>.</p>
work_keys_str_mv AT huanghuiling predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT charoenkwanphasit predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT kaotefen predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT leehuachin predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT changfanglin predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT huangwenlin predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT hoshinnjang predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT shulisun predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT chenwenliang predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
AT hoshinnying predictionandanalysisofproteinsolubilityusinganovelscoringcardmethodwithdipeptidecomposition
_version_ 1724885556083556352