Comparison of multiple imputation methods for missing data : A simulation study

Despite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing m...

Full description

Bibliographic Details
Main Author:	Schelhaas, Sjoerd
Format:	Others
Language:	English
Published:	Umeå universitet, Statistik 2021
Subjects:	Probability Theory and Statistics Sannolikhetsteori och statistik
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318

id	ndltd-UPSALLA1-oai-DiVA.org-umu-187318
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-umu-1873182021-10-12T05:26:31ZComparison of multiple imputation methods for missing data : A simulation studyengSchelhaas, SjoerdUmeå universitet, Statistik2021Probability Theory and StatisticsSannolikhetsteori och statistikDespite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing missing data are incapable of accounting for misleading representation of data. Research shows that these traditional methods like single imputation, often underestimate the variance. This problem can be bypassed by imputing a missing value multiple times and taking the uncertainty of imputing correctly into consideration. In this thesis a simulation study is conducted to compare two different multiple imputation models. A comparison between a defined linear stochastic regression model and a non defined flexible neural network model, where the validation MSE loss is used to account for variance in the imputed values, is done. In total there are three simulated data sets sampled from a multiple bivariate linear regression model where som of the values in Y2 are MAR given the Y1 variable. When applying a neural network on the datasets with 25, 50 and 75 percent missing values a total of 30 times and the result from the regression analysis on the complete data is pooled, the results show that almost all confidence intervals of the intercept are covering the expected value. The only exception was in the case of 75 percent missingness. When applying Multiple imputation by chained equations on the data sets, the true intercept is covered by all confidence intervals. When 25 percent of the data is missing, both models yield unbiased results. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Probability Theory and Statistics Sannolikhetsteori och statistik
spellingShingle	Probability Theory and Statistics Sannolikhetsteori och statistik Schelhaas, Sjoerd Comparison of multiple imputation methods for missing data : A simulation study
description	Despite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing missing data are incapable of accounting for misleading representation of data. Research shows that these traditional methods like single imputation, often underestimate the variance. This problem can be bypassed by imputing a missing value multiple times and taking the uncertainty of imputing correctly into consideration. In this thesis a simulation study is conducted to compare two different multiple imputation models. A comparison between a defined linear stochastic regression model and a non defined flexible neural network model, where the validation MSE loss is used to account for variance in the imputed values, is done. In total there are three simulated data sets sampled from a multiple bivariate linear regression model where som of the values in Y2 are MAR given the Y1 variable. When applying a neural network on the datasets with 25, 50 and 75 percent missing values a total of 30 times and the result from the regression analysis on the complete data is pooled, the results show that almost all confidence intervals of the intercept are covering the expected value. The only exception was in the case of 75 percent missingness. When applying Multiple imputation by chained equations on the data sets, the true intercept is covered by all confidence intervals. When 25 percent of the data is missing, both models yield unbiased results.
author	Schelhaas, Sjoerd
author_facet	Schelhaas, Sjoerd
author_sort	Schelhaas, Sjoerd
title	Comparison of multiple imputation methods for missing data : A simulation study
title_short	Comparison of multiple imputation methods for missing data : A simulation study
title_full	Comparison of multiple imputation methods for missing data : A simulation study
title_fullStr	Comparison of multiple imputation methods for missing data : A simulation study
title_full_unstemmed	Comparison of multiple imputation methods for missing data : A simulation study
title_sort	comparison of multiple imputation methods for missing data : a simulation study
publisher	Umeå universitet, Statistik
publishDate	2021
url	http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318
work_keys_str_mv	AT schelhaassjoerd comparisonofmultipleimputationmethodsformissingdataasimulationstudy
_version_	1719489383533379584

Comparison of multiple imputation methods for missing data : A simulation study

Similar Items