PON-Sol2: Prediction of Effects of Variants on Protein Solubility

Genetic variations have a multitude of effects on proteins. A substantial number of variations affect protein–solvent interactions, either aggregation or solubility. Aggregation is often related to structural alterations, whereas solubilizable proteins in the solid phase can be made again soluble by...

Full description

Bibliographic Details
Main Authors: Yang Yang, Lianjie Zeng, Mauno Vihinen
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:International Journal of Molecular Sciences
Subjects:
Online Access:https://www.mdpi.com/1422-0067/22/15/8027
id doaj-1fc233e9b365459b9add20dcd38a2468
record_format Article
spelling doaj-1fc233e9b365459b9add20dcd38a24682021-08-06T15:25:10ZengMDPI AGInternational Journal of Molecular Sciences1661-65961422-00672021-07-01228027802710.3390/ijms22158027PON-Sol2: Prediction of Effects of Variants on Protein SolubilityYang Yang0Lianjie Zeng1Mauno Vihinen2School of Computer Science and Technology, Soochow University, Suzhou 215006, ChinaSchool of Computer Science and Technology, Soochow University, Suzhou 215006, ChinaDepartment of Experimental Medical Science, Lund University, BMC B13, SE-221 84 Lund, SwedenGenetic variations have a multitude of effects on proteins. A substantial number of variations affect protein–solvent interactions, either aggregation or solubility. Aggregation is often related to structural alterations, whereas solubilizable proteins in the solid phase can be made again soluble by dilution. Solubility is a central protein property and when reduced can lead to diseases. We developed a prediction method, PON-Sol2, to identify amino acid substitutions that increase, decrease, or have no effect on the protein solubility. The method is a machine learning tool utilizing gradient boosting algorithm and was trained on a large dataset of variants with different outcomes after the selection of features among a large number of tested properties. The method is fast and has high performance. The normalized correct prediction rate for three states is 0.656, and the normalized GC2 score is 0.312 in 10-fold cross-validation. The corresponding numbers in the blind test were 0.545 and 0.157. The performance was superior in comparison to previous methods. The PON-Sol2 predictor is freely available. It can be used to predict the solubility effects of variants for any organism, even in large-scale projects.https://www.mdpi.com/1422-0067/22/15/8027protein solubility predictionpredictionmachine learningvariation interpretationartificial intelligencevariation
collection DOAJ
language English
format Article
sources DOAJ
author Yang Yang
Lianjie Zeng
Mauno Vihinen
spellingShingle Yang Yang
Lianjie Zeng
Mauno Vihinen
PON-Sol2: Prediction of Effects of Variants on Protein Solubility
International Journal of Molecular Sciences
protein solubility prediction
prediction
machine learning
variation interpretation
artificial intelligence
variation
author_facet Yang Yang
Lianjie Zeng
Mauno Vihinen
author_sort Yang Yang
title PON-Sol2: Prediction of Effects of Variants on Protein Solubility
title_short PON-Sol2: Prediction of Effects of Variants on Protein Solubility
title_full PON-Sol2: Prediction of Effects of Variants on Protein Solubility
title_fullStr PON-Sol2: Prediction of Effects of Variants on Protein Solubility
title_full_unstemmed PON-Sol2: Prediction of Effects of Variants on Protein Solubility
title_sort pon-sol2: prediction of effects of variants on protein solubility
publisher MDPI AG
series International Journal of Molecular Sciences
issn 1661-6596
1422-0067
publishDate 2021-07-01
description Genetic variations have a multitude of effects on proteins. A substantial number of variations affect protein–solvent interactions, either aggregation or solubility. Aggregation is often related to structural alterations, whereas solubilizable proteins in the solid phase can be made again soluble by dilution. Solubility is a central protein property and when reduced can lead to diseases. We developed a prediction method, PON-Sol2, to identify amino acid substitutions that increase, decrease, or have no effect on the protein solubility. The method is a machine learning tool utilizing gradient boosting algorithm and was trained on a large dataset of variants with different outcomes after the selection of features among a large number of tested properties. The method is fast and has high performance. The normalized correct prediction rate for three states is 0.656, and the normalized GC2 score is 0.312 in 10-fold cross-validation. The corresponding numbers in the blind test were 0.545 and 0.157. The performance was superior in comparison to previous methods. The PON-Sol2 predictor is freely available. It can be used to predict the solubility effects of variants for any organism, even in large-scale projects.
topic protein solubility prediction
prediction
machine learning
variation interpretation
artificial intelligence
variation
url https://www.mdpi.com/1422-0067/22/15/8027
work_keys_str_mv AT yangyang ponsol2predictionofeffectsofvariantsonproteinsolubility
AT lianjiezeng ponsol2predictionofeffectsofvariantsonproteinsolubility
AT maunovihinen ponsol2predictionofeffectsofvariantsonproteinsolubility
_version_ 1721218236738961408