Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the p...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Linnéuniversitetet, Institutionen för matematik (MA)
2014
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126 |
id |
ndltd-UPSALLA1-oai-DiVA.org-lnu-35126 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-lnu-351262014-06-17T05:04:37ZEvaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysisengWålinder, AndreasLinnéuniversitetet, Institutionen för matematik (MA)2014classificationlogistic regressionrandom forestmetadataModel selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable. There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection. Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test. Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance. We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
classification logistic regression random forest metadata |
spellingShingle |
classification logistic regression random forest metadata Wålinder, Andreas Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis |
description |
Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable. There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection. Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test. Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance. We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model. |
author |
Wålinder, Andreas |
author_facet |
Wålinder, Andreas |
author_sort |
Wålinder, Andreas |
title |
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis |
title_short |
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis |
title_full |
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis |
title_fullStr |
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis |
title_full_unstemmed |
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis |
title_sort |
evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis |
publisher |
Linnéuniversitetet, Institutionen för matematik (MA) |
publishDate |
2014 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-35126 |
work_keys_str_mv |
AT walinderandreas evaluationoflogisticregressionandrandomforestclassificationbasedonpredictionaccuracyandmetadataanalysis |
_version_ |
1716670248159019008 |