Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis

Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the p...

Full description

Bibliographic Details
Main Author: Wålinder, Andreas
Format: Others
Language:English
Published: Linnéuniversitetet, Institutionen för matematik (MA) 2014
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126
id ndltd-UPSALLA1-oai-DiVA.org-lnu-35126
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-lnu-351262014-06-17T05:04:37ZEvaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysisengWålinder, AndreasLinnéuniversitetet, Institutionen för matematik (MA)2014classificationlogistic regressionrandom forestmetadataModel selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable.     There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection.     Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test.     Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance.     We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic classification
logistic regression
random forest
metadata
spellingShingle classification
logistic regression
random forest
metadata
Wålinder, Andreas
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
description Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable.     There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection.     Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test.     Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance.     We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model.
author Wålinder, Andreas
author_facet Wålinder, Andreas
author_sort Wålinder, Andreas
title Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
title_short Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
title_full Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
title_fullStr Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
title_full_unstemmed Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
title_sort evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
publisher Linnéuniversitetet, Institutionen för matematik (MA)
publishDate 2014
url http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126
work_keys_str_mv AT walinderandreas evaluationoflogisticregressionandrandomforestclassificationbasedonpredictionaccuracyandmetadataanalysis
_version_ 1716670248159019008