The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.

In a broad range of fields it may be desirable to reuse a supervised classification algorithm and apply it to a new data set. However, generalization of such an algorithm and thus achieving a similar classification performance is only possible when the training data used to build the algorithm is si...

Full description

Bibliographic Details
Main Authors: Evelien Schat, Rens van de Schoot, Wouter M Kouw, Duco Veen, Adriënne M Mendrik
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0237009
id doaj-195e8c17af5c4237bd5689ba45eb6eef
record_format Article
spelling doaj-195e8c17af5c4237bd5689ba45eb6eef2021-03-03T22:03:55ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01158e023700910.1371/journal.pone.0237009The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.Evelien SchatRens van de SchootWouter M KouwDuco VeenAdriënne M MendrikIn a broad range of fields it may be desirable to reuse a supervised classification algorithm and apply it to a new data set. However, generalization of such an algorithm and thus achieving a similar classification performance is only possible when the training data used to build the algorithm is similar to new unseen data one wishes to apply it to. It is often unknown in advance how an algorithm will perform on new unseen data, being a crucial reason for not deploying an algorithm at all. Therefore, tools are needed to measure the similarity of data sets. In this paper, we propose the Data Representativeness Criterion (DRC) to determine how representative a training data set is of a new unseen data set. We present a proof of principle, to see whether the DRC can quantify the similarity of data sets and whether the DRC relates to the performance of a supervised classification algorithm. We compared a number of magnetic resonance imaging (MRI) data sets, ranging from subtle to severe difference is acquisition parameters. Results indicate that, based on the similarity of data sets, the DRC is able to give an indication as to when the performance of a supervised classifier decreases. The strictness of the DRC can be set by the user, depending on what one considers to be an acceptable underperformance.https://doi.org/10.1371/journal.pone.0237009
collection DOAJ
language English
format Article
sources DOAJ
author Evelien Schat
Rens van de Schoot
Wouter M Kouw
Duco Veen
Adriënne M Mendrik
spellingShingle Evelien Schat
Rens van de Schoot
Wouter M Kouw
Duco Veen
Adriënne M Mendrik
The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.
PLoS ONE
author_facet Evelien Schat
Rens van de Schoot
Wouter M Kouw
Duco Veen
Adriënne M Mendrik
author_sort Evelien Schat
title The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.
title_short The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.
title_full The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.
title_fullStr The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.
title_full_unstemmed The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity.
title_sort data representativeness criterion: predicting the performance of supervised classification based on data set similarity.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2020-01-01
description In a broad range of fields it may be desirable to reuse a supervised classification algorithm and apply it to a new data set. However, generalization of such an algorithm and thus achieving a similar classification performance is only possible when the training data used to build the algorithm is similar to new unseen data one wishes to apply it to. It is often unknown in advance how an algorithm will perform on new unseen data, being a crucial reason for not deploying an algorithm at all. Therefore, tools are needed to measure the similarity of data sets. In this paper, we propose the Data Representativeness Criterion (DRC) to determine how representative a training data set is of a new unseen data set. We present a proof of principle, to see whether the DRC can quantify the similarity of data sets and whether the DRC relates to the performance of a supervised classification algorithm. We compared a number of magnetic resonance imaging (MRI) data sets, ranging from subtle to severe difference is acquisition parameters. Results indicate that, based on the similarity of data sets, the DRC is able to give an indication as to when the performance of a supervised classifier decreases. The strictness of the DRC can be set by the user, depending on what one considers to be an acceptable underperformance.
url https://doi.org/10.1371/journal.pone.0237009
work_keys_str_mv AT evelienschat thedatarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT rensvandeschoot thedatarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT woutermkouw thedatarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT ducoveen thedatarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT adriennemmendrik thedatarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT evelienschat datarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT rensvandeschoot datarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT woutermkouw datarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT ducoveen datarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
AT adriennemmendrik datarepresentativenesscriterionpredictingtheperformanceofsupervisedclassificationbasedondatasetsimilarity
_version_ 1714813547864653824