How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data

Objectives: Missing data is a recurrent issue in many fields of medical research, particularly in questionnaires. The aim of this article is to describe and compare six conceptually different multiple imputation methods, alongside the commonly used complete case analysis, and to explore whether the...

Full description

Bibliographic Details
Main Authors:	Marianne Riksheim Stavseth, Thomas Clausen, Jo Røislien
Format:	Article
Language:	English
Published:	SAGE Publishing 2019-01-01
Series:	SAGE Open Medicine
Online Access:	https://doi.org/10.1177/2050312118822912

id	doaj-3c5221fcc6cb490c90ce934c137944af
record_format	Article
spelling	doaj-3c5221fcc6cb490c90ce934c137944af2020-11-25T03:40:31ZengSAGE PublishingSAGE Open Medicine2050-31212019-01-01710.1177/2050312118822912How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire dataMarianne Riksheim Stavseth0Thomas Clausen1Jo Røislien2Norwegian Centre for Addiction Research, Institute of Clinical Medicine, University of Oslo, Oslo, NorwayNorwegian Centre for Addiction Research, Institute of Clinical Medicine, University of Oslo, Oslo, NorwayFaculty of Health Sciences, University of Stavanger, Stavanger, NorwayObjectives: Missing data is a recurrent issue in many fields of medical research, particularly in questionnaires. The aim of this article is to describe and compare six conceptually different multiple imputation methods, alongside the commonly used complete case analysis, and to explore whether the choice of methodology for handling missing data might impact clinical conclusions drawn from a regression model when data are categorical. Methods: In addition to the commonly used complete case analysis, we tested the following six imputation methods: multiple imputation using expectation–maximization with bootstrapping, multiple imputation using multiple correspondence analysis, multiple imputation using latent class analysis, multiple hot deck imputation and multivariate imputation by chained equations with two different model specifications: logistic regression and random forests. The methods are tested on real data from a questionnaire-based study in the Norwegian opioid maintenance treatment programme. Results: All methods performed relatively well when the sample size was large (n = 1000). For a smaller sample size (n = 200), the regression estimates depend heavily on the level of missing. When the amount of missing was ⩾20%, in particular, complete case analysis, hot deck and random forests had biased estimates with too low coverage. Multiple imputation using multiple correspondence analysis had the best performance all over. Conclusion: The choice of missing handling methodology has a significant impact on the clinical interpretation of the accompanying statistical analyses. With missing data, the choice of whether to impute or not, and choice of imputation method, can influence clinical conclusion drawn from a regression model and should therefore be given sufficient consideration.https://doi.org/10.1177/2050312118822912
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Marianne Riksheim Stavseth Thomas Clausen Jo Røislien
spellingShingle	Marianne Riksheim Stavseth Thomas Clausen Jo Røislien How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data SAGE Open Medicine
author_facet	Marianne Riksheim Stavseth Thomas Clausen Jo Røislien
author_sort	Marianne Riksheim Stavseth
title	How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data
title_short	How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data
title_full	How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data
title_fullStr	How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data
title_full_unstemmed	How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data
title_sort	how handling missing data may impact conclusions: a comparison of six different imputation methods for categorical questionnaire data
publisher	SAGE Publishing
series	SAGE Open Medicine
issn	2050-3121
publishDate	2019-01-01
description	Objectives: Missing data is a recurrent issue in many fields of medical research, particularly in questionnaires. The aim of this article is to describe and compare six conceptually different multiple imputation methods, alongside the commonly used complete case analysis, and to explore whether the choice of methodology for handling missing data might impact clinical conclusions drawn from a regression model when data are categorical. Methods: In addition to the commonly used complete case analysis, we tested the following six imputation methods: multiple imputation using expectation–maximization with bootstrapping, multiple imputation using multiple correspondence analysis, multiple imputation using latent class analysis, multiple hot deck imputation and multivariate imputation by chained equations with two different model specifications: logistic regression and random forests. The methods are tested on real data from a questionnaire-based study in the Norwegian opioid maintenance treatment programme. Results: All methods performed relatively well when the sample size was large (n = 1000). For a smaller sample size (n = 200), the regression estimates depend heavily on the level of missing. When the amount of missing was ⩾20%, in particular, complete case analysis, hot deck and random forests had biased estimates with too low coverage. Multiple imputation using multiple correspondence analysis had the best performance all over. Conclusion: The choice of missing handling methodology has a significant impact on the clinical interpretation of the accompanying statistical analyses. With missing data, the choice of whether to impute or not, and choice of imputation method, can influence clinical conclusion drawn from a regression model and should therefore be given sufficient consideration.
url	https://doi.org/10.1177/2050312118822912
work_keys_str_mv	AT marianneriksheimstavseth howhandlingmissingdatamayimpactconclusionsacomparisonofsixdifferentimputationmethodsforcategoricalquestionnairedata AT thomasclausen howhandlingmissingdatamayimpactconclusionsacomparisonofsixdifferentimputationmethodsforcategoricalquestionnairedata AT jorøislien howhandlingmissingdatamayimpactconclusionsacomparisonofsixdifferentimputationmethodsforcategoricalquestionnairedata
_version_	1724534232984846336

How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data

Similar Items