Comparison of multiple imputation methods for missing data : A simulation study
Despite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing m...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Umeå universitet, Statistik
2021
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318 |
id |
ndltd-UPSALLA1-oai-DiVA.org-umu-187318 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-umu-1873182021-10-12T05:26:31ZComparison of multiple imputation methods for missing data : A simulation studyengSchelhaas, SjoerdUmeå universitet, Statistik2021Probability Theory and StatisticsSannolikhetsteori och statistikDespite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing missing data are incapable of accounting for misleading representation of data. Research shows that these traditional methods like single imputation, often underestimate the variance. This problem can be bypassed by imputing a missing value multiple times and taking the uncertainty of imputing correctly into consideration. In this thesis a simulation study is conducted to compare two different multiple imputation models. A comparison between a defined linear stochastic regression model and a non defined flexible neural network model, where the validation MSE loss is used to account for variance in the imputed values, is done. In total there are three simulated data sets sampled from a multiple bivariate linear regression model where som of the values in Y2 are MAR given the Y1 variable. When applying a neural network on the datasets with 25, 50 and 75 percent missing values a total of 30 times and the result from the regression analysis on the complete data is pooled, the results show that almost all confidence intervals of the intercept are covering the expected value. The only exception was in the case of 75 percent missingness. When applying Multiple imputation by chained equations on the data sets, the true intercept is covered by all confidence intervals. When 25 percent of the data is missing, both models yield unbiased results. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Probability Theory and Statistics Sannolikhetsteori och statistik |
spellingShingle |
Probability Theory and Statistics Sannolikhetsteori och statistik Schelhaas, Sjoerd Comparison of multiple imputation methods for missing data : A simulation study |
description |
Despite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing missing data are incapable of accounting for misleading representation of data. Research shows that these traditional methods like single imputation, often underestimate the variance. This problem can be bypassed by imputing a missing value multiple times and taking the uncertainty of imputing correctly into consideration. In this thesis a simulation study is conducted to compare two different multiple imputation models. A comparison between a defined linear stochastic regression model and a non defined flexible neural network model, where the validation MSE loss is used to account for variance in the imputed values, is done. In total there are three simulated data sets sampled from a multiple bivariate linear regression model where som of the values in Y2 are MAR given the Y1 variable. When applying a neural network on the datasets with 25, 50 and 75 percent missing values a total of 30 times and the result from the regression analysis on the complete data is pooled, the results show that almost all confidence intervals of the intercept are covering the expected value. The only exception was in the case of 75 percent missingness. When applying Multiple imputation by chained equations on the data sets, the true intercept is covered by all confidence intervals. When 25 percent of the data is missing, both models yield unbiased results. |
author |
Schelhaas, Sjoerd |
author_facet |
Schelhaas, Sjoerd |
author_sort |
Schelhaas, Sjoerd |
title |
Comparison of multiple imputation methods for missing data : A simulation study |
title_short |
Comparison of multiple imputation methods for missing data : A simulation study |
title_full |
Comparison of multiple imputation methods for missing data : A simulation study |
title_fullStr |
Comparison of multiple imputation methods for missing data : A simulation study |
title_full_unstemmed |
Comparison of multiple imputation methods for missing data : A simulation study |
title_sort |
comparison of multiple imputation methods for missing data : a simulation study |
publisher |
Umeå universitet, Statistik |
publishDate |
2021 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318 |
work_keys_str_mv |
AT schelhaassjoerd comparisonofmultipleimputationmethodsformissingdataasimulationstudy |
_version_ |
1719489383533379584 |