Logistic Regression, Measures of Explained Variation, and the Base Rate Problem

One of the desirable properties of the coefficient of determinant (R2 measure) is that its values for different models should be comparable whether the models differ in one or more predictors, or in the dependent variable, or whether the models are specified as being different for different subsets...

Full description

Bibliographic Details
Other Authors: Sharma, Dinesh R. (authoraut)
Format: Others
Language:English
English
Published: Florida State University
Subjects:
Online Access:http://purl.flvc.org/fsu/fd/FSU_migr_etd-1789
id ndltd-fsu.edu-oai-fsu.digital.flvc.org-fsu_176270
record_format oai_dc
collection NDLTD
language English
English
format Others
sources NDLTD
topic Statistics
Probabilities
spellingShingle Statistics
Probabilities
Logistic Regression, Measures of Explained Variation, and the Base Rate Problem
description One of the desirable properties of the coefficient of determinant (R2 measure) is that its values for different models should be comparable whether the models differ in one or more predictors, or in the dependent variable, or whether the models are specified as being different for different subsets of a dataset. This allows researchers to compare adequacy of models across subgroups of the population or models with different but related dependent variables. However, the various analogs of the R2 measure used for logistic regression analysis are highly sensitive to the base rate (proportion of successes in the sample) and thus do not possess this property. An R2 measure sensitive to the base rate is not suitable to comparison for the same or different model on different datasets, different subsets of a dataset or different but related dependent variables. We evaluated 14 R2 measures that have been suggested or might be useful to measure the explained variation in the logistic regression models based on three criteria 1) intuitively reasonable interpret ability; 2) numerical consistency with the Rho2 of underlying model, and 3) the base rate sensitivity. We carried out a Monte Carlo Simulation study to examine the numerical consistency and the base rate dependency of the various R2 measures for logistic regression analysis. We found all of the parametric R2 measures to be substantially sensitive to the base rate. The magnitude of the base rate sensitivity of these measures tends to be further influenced by the rho2 of the underlying model. None of the measures considered in our study are found to perform equally well in all of the three evaluation criteria used. While R2L stands out for its intuitively reasonable interpretability as a measures of explained variation as well as its independence from the base rate, it appears to severely underestimate the underlying rho2. We found R2CS to be numerically most consistent with the underlying Rho2, with R2N its nearest competitor. In addition, the base rate sensitivity of these two measures appears to be very close to that of the R2L, the most base rate invariant parametric R2 measure. Therefore, we suggest to use R2CS and R2N for logistic regression modeling, specially when it is reasonable to believe that a underlying latent variable exists. However, when the latent variable does not exit, comparability with theunderlying rho2 is not an issue and R2L might be a better choice over all the R2 measures. === A Dissertation Submitted to the Department of Statistics in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy. === Summer Semester, 2006. === June 26, 2006. === Logistic Regression, Explained Variation, Base Rate, Base Rate Problem, Coefficient of Determinant, R^2 Statistics, Latent Variable === Includes bibliographical references. === Daniel L. McGee, Sr., Professor Directing Dissertation; Myra Hurt, Outside Committee Member; Xu-Feng Niu, Committee Member; Eric Chicken, Committee Member.
author2 Sharma, Dinesh R. (authoraut)
author_facet Sharma, Dinesh R. (authoraut)
title Logistic Regression, Measures of Explained Variation, and the Base Rate Problem
title_short Logistic Regression, Measures of Explained Variation, and the Base Rate Problem
title_full Logistic Regression, Measures of Explained Variation, and the Base Rate Problem
title_fullStr Logistic Regression, Measures of Explained Variation, and the Base Rate Problem
title_full_unstemmed Logistic Regression, Measures of Explained Variation, and the Base Rate Problem
title_sort logistic regression, measures of explained variation, and the base rate problem
publisher Florida State University
url http://purl.flvc.org/fsu/fd/FSU_migr_etd-1789
_version_ 1719318052334469120
spelling ndltd-fsu.edu-oai-fsu.digital.flvc.org-fsu_1762702020-06-05T03:08:28Z Logistic Regression, Measures of Explained Variation, and the Base Rate Problem Sharma, Dinesh R. (authoraut) McGee, Daniel L. (professor directing dissertation) Hurt, Myra (outside committee member) Niu, Xu-Feng (committee member) Chicken, Eric (committee member) Department of Statistics (degree granting department) Florida State University (degree granting institution) Text text Florida State University Florida State University English eng 1 online resource computer application/pdf One of the desirable properties of the coefficient of determinant (R2 measure) is that its values for different models should be comparable whether the models differ in one or more predictors, or in the dependent variable, or whether the models are specified as being different for different subsets of a dataset. This allows researchers to compare adequacy of models across subgroups of the population or models with different but related dependent variables. However, the various analogs of the R2 measure used for logistic regression analysis are highly sensitive to the base rate (proportion of successes in the sample) and thus do not possess this property. An R2 measure sensitive to the base rate is not suitable to comparison for the same or different model on different datasets, different subsets of a dataset or different but related dependent variables. We evaluated 14 R2 measures that have been suggested or might be useful to measure the explained variation in the logistic regression models based on three criteria 1) intuitively reasonable interpret ability; 2) numerical consistency with the Rho2 of underlying model, and 3) the base rate sensitivity. We carried out a Monte Carlo Simulation study to examine the numerical consistency and the base rate dependency of the various R2 measures for logistic regression analysis. We found all of the parametric R2 measures to be substantially sensitive to the base rate. The magnitude of the base rate sensitivity of these measures tends to be further influenced by the rho2 of the underlying model. None of the measures considered in our study are found to perform equally well in all of the three evaluation criteria used. While R2L stands out for its intuitively reasonable interpretability as a measures of explained variation as well as its independence from the base rate, it appears to severely underestimate the underlying rho2. We found R2CS to be numerically most consistent with the underlying Rho2, with R2N its nearest competitor. In addition, the base rate sensitivity of these two measures appears to be very close to that of the R2L, the most base rate invariant parametric R2 measure. Therefore, we suggest to use R2CS and R2N for logistic regression modeling, specially when it is reasonable to believe that a underlying latent variable exists. However, when the latent variable does not exit, comparability with theunderlying rho2 is not an issue and R2L might be a better choice over all the R2 measures. A Dissertation Submitted to the Department of Statistics in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy. Summer Semester, 2006. June 26, 2006. Logistic Regression, Explained Variation, Base Rate, Base Rate Problem, Coefficient of Determinant, R^2 Statistics, Latent Variable Includes bibliographical references. Daniel L. McGee, Sr., Professor Directing Dissertation; Myra Hurt, Outside Committee Member; Xu-Feng Niu, Committee Member; Eric Chicken, Committee Member. Statistics Probabilities FSU_migr_etd-1789 http://purl.flvc.org/fsu/fd/FSU_migr_etd-1789 This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). The copyright in theses and dissertations completed at Florida State University is held by the students who author them. http://diginole.lib.fsu.edu/islandora/object/fsu%3A176270/datastream/TN/view/Logistic%20Regression%2C%20Measures%20of%20Explained%20Variation%2C%20and%20the%20Base%20Rate%20Problem.jpg