Comparison of multiple imputation methods for missing data : A simulation study

Despite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing m...

Full description

Bibliographic Details
Main Author: Schelhaas, Sjoerd
Format: Others
Language:English
Published: Umeå universitet, Statistik 2021
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318
id ndltd-UPSALLA1-oai-DiVA.org-umu-187318
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-umu-1873182021-10-12T05:26:31ZComparison of multiple imputation methods for missing data : A simulation studyengSchelhaas, SjoerdUmeå universitet, Statistik2021Probability Theory and StatisticsSannolikhetsteori och statistikDespite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing missing data are incapable of accounting for misleading representation of data. Research shows that these traditional methods like single imputation, often underestimate the variance. This problem can be bypassed by imputing a missing value multiple times and taking the uncertainty of imputing correctly into consideration. In this thesis a simulation study is conducted to compare two different multiple imputation models. A comparison between a defined linear stochastic regression model and a non defined flexible neural network model, where the validation MSE loss is used to account for variance in the imputed values, is done. In total there are three simulated data sets sampled from a multiple bivariate linear regression model where som of the values in Y2 are MAR given the Y1 variable. When applying a neural network on the datasets with 25, 50 and 75 percent missing values a total of 30 times and the result from the regression analysis on the complete data is pooled, the results show that almost all confidence intervals of the intercept are covering the expected value. The only exception was in the case of 75 percent missingness. When applying Multiple imputation by chained equations on the data sets, the true intercept is covered by all confidence intervals. When 25 percent of the data is missing, both models yield unbiased results. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Probability Theory and Statistics
Sannolikhetsteori och statistik
spellingShingle Probability Theory and Statistics
Sannolikhetsteori och statistik
Schelhaas, Sjoerd
Comparison of multiple imputation methods for missing data : A simulation study
description Despite a well-designed and controlled study, missing values are consistently present inresearch. It is well established that when disregarding missingness by analyzing completecases only, statistical power is reduced and parameter estimates are biased. The existing traditional methods of imputing missing data are incapable of accounting for misleading representation of data. Research shows that these traditional methods like single imputation, often underestimate the variance. This problem can be bypassed by imputing a missing value multiple times and taking the uncertainty of imputing correctly into consideration. In this thesis a simulation study is conducted to compare two different multiple imputation models. A comparison between a defined linear stochastic regression model and a non defined flexible neural network model, where the validation MSE loss is used to account for variance in the imputed values, is done. In total there are three simulated data sets sampled from a multiple bivariate linear regression model where som of the values in Y2 are MAR given the Y1 variable. When applying a neural network on the datasets with 25, 50 and 75 percent missing values a total of 30 times and the result from the regression analysis on the complete data is pooled, the results show that almost all confidence intervals of the intercept are covering the expected value. The only exception was in the case of 75 percent missingness. When applying Multiple imputation by chained equations on the data sets, the true intercept is covered by all confidence intervals. When 25 percent of the data is missing, both models yield unbiased results.
author Schelhaas, Sjoerd
author_facet Schelhaas, Sjoerd
author_sort Schelhaas, Sjoerd
title Comparison of multiple imputation methods for missing data : A simulation study
title_short Comparison of multiple imputation methods for missing data : A simulation study
title_full Comparison of multiple imputation methods for missing data : A simulation study
title_fullStr Comparison of multiple imputation methods for missing data : A simulation study
title_full_unstemmed Comparison of multiple imputation methods for missing data : A simulation study
title_sort comparison of multiple imputation methods for missing data : a simulation study
publisher Umeå universitet, Statistik
publishDate 2021
url http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-187318
work_keys_str_mv AT schelhaassjoerd comparisonofmultipleimputationmethodsformissingdataasimulationstudy
_version_ 1719489383533379584