A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms

Abstract Background In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are...

Full description

Bibliographic Details
Main Authors:	André M. Carrington, Paul W. Fieguth, Hammad Qazi, Andreas Holzinger, Helen H. Chen, Franz Mayr, Douglas G. Manuel
Format:	Article
Language:	English
Published:	BMC 2020-01-01
Series:	BMC Medical Informatics and Decision Making
Subjects:	Area under the ROC curve Receiver operating characteristic C statistic Concordance Partial area index Imbalanced data
Online Access:	https://doi.org/10.1186/s12911-019-1014-6

id	doaj-8dbdb829787b4c91bfe5658820118fb6
record_format	Article
spelling	doaj-8dbdb829787b4c91bfe5658820118fb62021-01-10T12:53:07ZengBMCBMC Medical Informatics and Decision Making1472-69472020-01-0120111210.1186/s12911-019-1014-6A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithmsAndré M. Carrington0Paul W. Fieguth1Hammad Qazi2Andreas Holzinger3Helen H. Chen4Franz Mayr5Douglas G. Manuel6Ottawa Hospital Research InstituteFaculty of Engineering, University of WaterlooSchool of Public Health and Health Systems, University of WaterlooHolzinger Group (HCAI), Institute for Medical Informatics/Statistics, Medical University GrazSchool of Public Health and Health Systems, University of WaterlooUniversidad ORT UruguayOttawa Hospital Research InstituteAbstract Background In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. Methods We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. Results Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. Conclusions The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. Future work Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.https://doi.org/10.1186/s12911-019-1014-6Area under the ROC curveReceiver operating characteristicC statisticConcordancePartial area indexImbalanced data
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	André M. Carrington Paul W. Fieguth Hammad Qazi Andreas Holzinger Helen H. Chen Franz Mayr Douglas G. Manuel
spellingShingle	André M. Carrington Paul W. Fieguth Hammad Qazi Andreas Holzinger Helen H. Chen Franz Mayr Douglas G. Manuel A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms BMC Medical Informatics and Decision Making Area under the ROC curve Receiver operating characteristic C statistic Concordance Partial area index Imbalanced data
author_facet	André M. Carrington Paul W. Fieguth Hammad Qazi Andreas Holzinger Helen H. Chen Franz Mayr Douglas G. Manuel
author_sort	André M. Carrington
title	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_short	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_full	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_fullStr	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_full_unstemmed	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_sort	new concordant partial auc and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
publisher	BMC
series	BMC Medical Informatics and Decision Making
issn	1472-6947
publishDate	2020-01-01
description	Abstract Background In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. Methods We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. Results Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. Conclusions The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. Future work Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.
topic	Area under the ROC curve Receiver operating characteristic C statistic Concordance Partial area index Imbalanced data
url	https://doi.org/10.1186/s12911-019-1014-6
work_keys_str_mv	AT andremcarrington anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT paulwfieguth anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT hammadqazi anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT andreasholzinger anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT helenhchen anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT franzmayr anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT douglasgmanuel anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT andremcarrington newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT paulwfieguth newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT hammadqazi newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT andreasholzinger newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT helenhchen newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT franzmayr newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT douglasgmanuel newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms
_version_	1724342110095671296

A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms

Similar Items