Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification

The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set...

Full description

Bibliographic Details
Main Author:	Säfström, Stella
Format:	Others
Language:	English
Published:	Uppsala universitet, Statistiska institutionen 2019
Subjects:	Random forest logistic regression imputation classification MCAR missing data imbalanced data Probability Theory and Statistics Sannolikhetsteori och statistik
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388581

id	ndltd-UPSALLA1-oai-DiVA.org-uu-388581
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-uu-3885812019-07-03T10:06:36ZPredicting the Unobserved : A statistical analysis of missing data techniques for binary classificationengSäfström, StellaUppsala universitet, Statistiska institutionen2019Random forestlogistic regressionimputationclassificationMCARmissing dataimbalanced dataProbability Theory and StatisticsSannolikhetsteori och statistikThe aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388581application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Random forest logistic regression imputation classification MCAR missing data imbalanced data Probability Theory and Statistics Sannolikhetsteori och statistik
spellingShingle	Random forest logistic regression imputation classification MCAR missing data imbalanced data Probability Theory and Statistics Sannolikhetsteori och statistik Säfström, Stella Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification
description	The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance.
author	Säfström, Stella
author_facet	Säfström, Stella
author_sort	Säfström, Stella
title	Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification
title_short	Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification
title_full	Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification
title_fullStr	Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification
title_full_unstemmed	Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification
title_sort	predicting the unobserved : a statistical analysis of missing data techniques for binary classification
publisher	Uppsala universitet, Statistiska institutionen
publishDate	2019
url	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388581
work_keys_str_mv	AT safstromstella predictingtheunobservedastatisticalanalysisofmissingdatatechniquesforbinaryclassification
_version_	1719218827551571968

Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification

Similar Items