A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters

Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and th...

Full description

Bibliographic Details
Main Authors: K S Mwitondi, R E Moustafa, A S Hadi
Format: Article
Language:English
Published: Ubiquity Press 2013-05-01
Series:Data Science Journal
Subjects:
Online Access:http://datascience.codata.org/articles/172
id doaj-bc36c2811c6f49a5a5af9d254c9de1a0
record_format Article
spelling doaj-bc36c2811c6f49a5a5af9d254c9de1a02020-11-24T23:13:07ZengUbiquity PressData Science Journal1683-14702013-05-011210.2481/dsj.WDS-045172A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model ParametersK S Mwitondi0R E Moustafa1A S Hadi2Sheffield Hallam University, Faculty of Arts, Computing, Engineering and Sciences, Sheffield S1 1WB, UKGeorge Washington University, Statistics Department, 2140 Pennsylvania Ave., NW, Washington DC, 20052, USAThe American University in Cairo, Egypt/Cornell University, 291 Ives Hall, Cornell University, Ithaca, NY 14853-3901, USADifferences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance.http://datascience.codata.org/articles/172Bayesian errorData miningData visualisationDecision treesDomain partitioningOptimal bandwidthROC curvesVisual analyticsYouden Index
collection DOAJ
language English
format Article
sources DOAJ
author K S Mwitondi
R E Moustafa
A S Hadi
spellingShingle K S Mwitondi
R E Moustafa
A S Hadi
A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
Data Science Journal
Bayesian error
Data mining
Data visualisation
Decision trees
Domain partitioning
Optimal bandwidth
ROC curves
Visual analytics
Youden Index
author_facet K S Mwitondi
R E Moustafa
A S Hadi
author_sort K S Mwitondi
title A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
title_short A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
title_full A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
title_fullStr A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
title_full_unstemmed A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
title_sort data-driven method for selecting optimal models based on graphical visualisation of differences in sequentially fitted roc model parameters
publisher Ubiquity Press
series Data Science Journal
issn 1683-1470
publishDate 2013-05-01
description Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance.
topic Bayesian error
Data mining
Data visualisation
Decision trees
Domain partitioning
Optimal bandwidth
ROC curves
Visual analytics
Youden Index
url http://datascience.codata.org/articles/172
work_keys_str_mv AT ksmwitondi adatadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters
AT remoustafa adatadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters
AT ashadi adatadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters
AT ksmwitondi datadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters
AT remoustafa datadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters
AT ashadi datadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters
_version_ 1725599221031108608