Assessing predictors for new post translational modification sites: A case study on hydroxylation.

Post-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation i...

Full description

Bibliographic Details
Main Authors: Damiano Piovesan, Andras Hatos, Giovanni Minervini, Federica Quaglia, Alexander Miguel Monzon, Silvio C E Tosatto
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-06-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1007967
id doaj-622e220bd713406783649d3739009b3a
record_format Article
spelling doaj-622e220bd713406783649d3739009b3a2021-04-21T15:17:08ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582020-06-01166e100796710.1371/journal.pcbi.1007967Assessing predictors for new post translational modification sites: A case study on hydroxylation.Damiano PiovesanAndras HatosGiovanni MinerviniFederica QuagliaAlexander Miguel MonzonSilvio C E TosattoPost-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a guide for effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance may often not be indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models do not sufficiently generalize to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. As hydroxylation site predictors do not generalize for new data, caution is advised when using PTM predictors in the absence of independent evaluations, in particular for highly specific sites involved in signalling.https://doi.org/10.1371/journal.pcbi.1007967
collection DOAJ
language English
format Article
sources DOAJ
author Damiano Piovesan
Andras Hatos
Giovanni Minervini
Federica Quaglia
Alexander Miguel Monzon
Silvio C E Tosatto
spellingShingle Damiano Piovesan
Andras Hatos
Giovanni Minervini
Federica Quaglia
Alexander Miguel Monzon
Silvio C E Tosatto
Assessing predictors for new post translational modification sites: A case study on hydroxylation.
PLoS Computational Biology
author_facet Damiano Piovesan
Andras Hatos
Giovanni Minervini
Federica Quaglia
Alexander Miguel Monzon
Silvio C E Tosatto
author_sort Damiano Piovesan
title Assessing predictors for new post translational modification sites: A case study on hydroxylation.
title_short Assessing predictors for new post translational modification sites: A case study on hydroxylation.
title_full Assessing predictors for new post translational modification sites: A case study on hydroxylation.
title_fullStr Assessing predictors for new post translational modification sites: A case study on hydroxylation.
title_full_unstemmed Assessing predictors for new post translational modification sites: A case study on hydroxylation.
title_sort assessing predictors for new post translational modification sites: a case study on hydroxylation.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2020-06-01
description Post-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a guide for effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance may often not be indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models do not sufficiently generalize to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. As hydroxylation site predictors do not generalize for new data, caution is advised when using PTM predictors in the absence of independent evaluations, in particular for highly specific sites involved in signalling.
url https://doi.org/10.1371/journal.pcbi.1007967
work_keys_str_mv AT damianopiovesan assessingpredictorsfornewposttranslationalmodificationsitesacasestudyonhydroxylation
AT andrashatos assessingpredictorsfornewposttranslationalmodificationsitesacasestudyonhydroxylation
AT giovanniminervini assessingpredictorsfornewposttranslationalmodificationsitesacasestudyonhydroxylation
AT federicaquaglia assessingpredictorsfornewposttranslationalmodificationsitesacasestudyonhydroxylation
AT alexandermiguelmonzon assessingpredictorsfornewposttranslationalmodificationsitesacasestudyonhydroxylation
AT silviocetosatto assessingpredictorsfornewposttranslationalmodificationsitesacasestudyonhydroxylation
_version_ 1714667519335202816