A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters
Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and th...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ubiquity Press
2013-05-01
|
Series: | Data Science Journal |
Subjects: | |
Online Access: | http://datascience.codata.org/articles/172 |
id |
doaj-bc36c2811c6f49a5a5af9d254c9de1a0 |
---|---|
record_format |
Article |
spelling |
doaj-bc36c2811c6f49a5a5af9d254c9de1a02020-11-24T23:13:07ZengUbiquity PressData Science Journal1683-14702013-05-011210.2481/dsj.WDS-045172A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model ParametersK S Mwitondi0R E Moustafa1A S Hadi2Sheffield Hallam University, Faculty of Arts, Computing, Engineering and Sciences, Sheffield S1 1WB, UKGeorge Washington University, Statistics Department, 2140 Pennsylvania Ave., NW, Washington DC, 20052, USAThe American University in Cairo, Egypt/Cornell University, 291 Ives Hall, Cornell University, Ithaca, NY 14853-3901, USADifferences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance.http://datascience.codata.org/articles/172Bayesian errorData miningData visualisationDecision treesDomain partitioningOptimal bandwidthROC curvesVisual analyticsYouden Index |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
K S Mwitondi R E Moustafa A S Hadi |
spellingShingle |
K S Mwitondi R E Moustafa A S Hadi A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters Data Science Journal Bayesian error Data mining Data visualisation Decision trees Domain partitioning Optimal bandwidth ROC curves Visual analytics Youden Index |
author_facet |
K S Mwitondi R E Moustafa A S Hadi |
author_sort |
K S Mwitondi |
title |
A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters |
title_short |
A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters |
title_full |
A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters |
title_fullStr |
A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters |
title_full_unstemmed |
A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters |
title_sort |
data-driven method for selecting optimal models based on graphical visualisation of differences in sequentially fitted roc model parameters |
publisher |
Ubiquity Press |
series |
Data Science Journal |
issn |
1683-1470 |
publishDate |
2013-05-01 |
description |
Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance. |
topic |
Bayesian error Data mining Data visualisation Decision trees Domain partitioning Optimal bandwidth ROC curves Visual analytics Youden Index |
url |
http://datascience.codata.org/articles/172 |
work_keys_str_mv |
AT ksmwitondi adatadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters AT remoustafa adatadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters AT ashadi adatadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters AT ksmwitondi datadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters AT remoustafa datadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters AT ashadi datadrivenmethodforselectingoptimalmodelsbasedongraphicalvisualisationofdifferencesinsequentiallyfittedrocmodelparameters |
_version_ |
1725599221031108608 |