Generating Synthetic Missing Data: A Review by Missing Mechanism
The performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct mi...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8605316/ |
id |
doaj-771d91a8bf8d49af8faeed66e034b03f |
---|---|
record_format |
Article |
spelling |
doaj-771d91a8bf8d49af8faeed66e034b03f2021-03-29T22:02:40ZengIEEEIEEE Access2169-35362019-01-017116511166710.1109/ACCESS.2019.28913608605316Generating Synthetic Missing Data: A Review by Missing MechanismMiriam Seoane Santos0https://orcid.org/0000-0002-5912-963XRicardo Cardoso Pereira1Adriana Fonseca Costa2Jastin Pompeu Soares3Joao Santos4Pedro Henriques Abreu5Department of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalMedical Physics, Radiobiology and Radiation Protection Group, IPO Porto Research Center (CI-IPOP), Porto, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalThe performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct missing mechanisms, namely, missing completely at random, missing at random, and missing not at random. Since the missing data generation process defines the basis for the imputation experiments (configuration, missing rate, and missing mechanism), it is essential that it is appropriately applied; otherwise, conclusions derived from ill-defined setups may be invalid. The goal of this paper is to review the different approaches to synthetic missing data generation found in the literature and discuss their practical details, elaborating on their strengths and weaknesses. Our analysis revealed that creating missing at random and missing not at random scenarios in datasets comprising qualitative features is the most challenging issue in the related work and, therefore, should be the focus of future work in the field.https://ieeexplore.ieee.org/document/8605316/Data preprocessingmissing datamissing data generationmissing data mechanisms |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Miriam Seoane Santos Ricardo Cardoso Pereira Adriana Fonseca Costa Jastin Pompeu Soares Joao Santos Pedro Henriques Abreu |
spellingShingle |
Miriam Seoane Santos Ricardo Cardoso Pereira Adriana Fonseca Costa Jastin Pompeu Soares Joao Santos Pedro Henriques Abreu Generating Synthetic Missing Data: A Review by Missing Mechanism IEEE Access Data preprocessing missing data missing data generation missing data mechanisms |
author_facet |
Miriam Seoane Santos Ricardo Cardoso Pereira Adriana Fonseca Costa Jastin Pompeu Soares Joao Santos Pedro Henriques Abreu |
author_sort |
Miriam Seoane Santos |
title |
Generating Synthetic Missing Data: A Review by Missing Mechanism |
title_short |
Generating Synthetic Missing Data: A Review by Missing Mechanism |
title_full |
Generating Synthetic Missing Data: A Review by Missing Mechanism |
title_fullStr |
Generating Synthetic Missing Data: A Review by Missing Mechanism |
title_full_unstemmed |
Generating Synthetic Missing Data: A Review by Missing Mechanism |
title_sort |
generating synthetic missing data: a review by missing mechanism |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2019-01-01 |
description |
The performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct missing mechanisms, namely, missing completely at random, missing at random, and missing not at random. Since the missing data generation process defines the basis for the imputation experiments (configuration, missing rate, and missing mechanism), it is essential that it is appropriately applied; otherwise, conclusions derived from ill-defined setups may be invalid. The goal of this paper is to review the different approaches to synthetic missing data generation found in the literature and discuss their practical details, elaborating on their strengths and weaknesses. Our analysis revealed that creating missing at random and missing not at random scenarios in datasets comprising qualitative features is the most challenging issue in the related work and, therefore, should be the focus of future work in the field. |
topic |
Data preprocessing missing data missing data generation missing data mechanisms |
url |
https://ieeexplore.ieee.org/document/8605316/ |
work_keys_str_mv |
AT miriamseoanesantos generatingsyntheticmissingdataareviewbymissingmechanism AT ricardocardosopereira generatingsyntheticmissingdataareviewbymissingmechanism AT adrianafonsecacosta generatingsyntheticmissingdataareviewbymissingmechanism AT jastinpompeusoares generatingsyntheticmissingdataareviewbymissingmechanism AT joaosantos generatingsyntheticmissingdataareviewbymissingmechanism AT pedrohenriquesabreu generatingsyntheticmissingdataareviewbymissingmechanism |
_version_ |
1724192285593174016 |