Predictive Accuracy Measures for Binary Outcomes: Impact of Incidence Rate and Optimization Techniques
Evaluating the performance of models predicting a binary outcome can be done using a variety of measures. While some measures intend to describe the model's overall fit, others more accurately describe the model's ability to discriminate between the two outcomes. If...
Other Authors: | |
---|---|
Format: | Others |
Language: | English English |
Published: |
Florida State University
|
Subjects: | |
Online Access: | http://purl.flvc.org/fsu/fd/FSU_2016SP_Scolnik_fsu_0071E_13146 |
Summary: | Evaluating the performance of models predicting a binary outcome can be done using a variety of measures. While some measures
intend to describe the model's overall fit, others more accurately describe the model's ability to discriminate between the two outcomes.
If a model fits well but doesn't discriminate well, what does that tell us? Given two models, if one discriminates well but has poor fit
while the other fits well but discriminates poorly, which of the two should we choose? The measures of interest for our research include
the area under the ROC curve, Brier Score, discrimination slope, Log-Loss, R-squared and F-score. To examine the underlying relationships
among all of the measures, real data and simulation studies are used. The real data comes from multiple cardiovascular research studies
and the simulation studies are run under general conditions and also for incidence rates ranging from 2% to 50%. The results of these
analyses provide insight into the relationships among the measures and raise concern for scenarios when the measures may yield different
conclusions. The impact of incidence rate on the relationships provides a basis for exploring alternative maximization routines to
logistic regression. While most of the measures are easily optimized using the Newton-Raphson algorithm, the maximization of the area
under the ROC curve requires optimization of a non-linear, non-differentiable function. Usage of the Nelder-Mead simplex algorithm and
close connections to economics research yield unique parameter estimates and general asymptotic conditions. Using real and simulated data
to compare optimizing the area under the ROC curve to logistic regression further reveals the impact of incidence rate on the
relationships, significant increases in achievable areas under the ROC curve, and differences in conclusions about including a variable in
a model. === A Dissertation submitted to the Department of Statistics in partial fulfillment of the requirements
for the degree of Doctor of Philosophy. === Spring Semester 2016. === April 8, 2016. === auc, brier score, incidence rate, logistic regression, optimization === Includes bibliographical references. === Daniel McGee, Professor Co-Directing Thesis; Elizabeth Slate, Professor Co-Directing Thesis;
Isaac Eberstein, University Representative; Fred Huffer, Committee Member. |
---|