Generating Synthetic Missing Data: A Review by Missing Mechanism

The performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct mi...

Full description

Bibliographic Details
Main Authors: Miriam Seoane Santos, Ricardo Cardoso Pereira, Adriana Fonseca Costa, Jastin Pompeu Soares, Joao Santos, Pedro Henriques Abreu
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8605316/
id doaj-771d91a8bf8d49af8faeed66e034b03f
record_format Article
spelling doaj-771d91a8bf8d49af8faeed66e034b03f2021-03-29T22:02:40ZengIEEEIEEE Access2169-35362019-01-017116511166710.1109/ACCESS.2019.28913608605316Generating Synthetic Missing Data: A Review by Missing MechanismMiriam Seoane Santos0https://orcid.org/0000-0002-5912-963XRicardo Cardoso Pereira1Adriana Fonseca Costa2Jastin Pompeu Soares3Joao Santos4Pedro Henriques Abreu5Department of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalMedical Physics, Radiobiology and Radiation Protection Group, IPO Porto Research Center (CI-IPOP), Porto, PortugalDepartment of Informatics Engineering, Centre for Informatics and Systems, University of Coimbra, Coimbra, PortugalThe performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct missing mechanisms, namely, missing completely at random, missing at random, and missing not at random. Since the missing data generation process defines the basis for the imputation experiments (configuration, missing rate, and missing mechanism), it is essential that it is appropriately applied; otherwise, conclusions derived from ill-defined setups may be invalid. The goal of this paper is to review the different approaches to synthetic missing data generation found in the literature and discuss their practical details, elaborating on their strengths and weaknesses. Our analysis revealed that creating missing at random and missing not at random scenarios in datasets comprising qualitative features is the most challenging issue in the related work and, therefore, should be the focus of future work in the field.https://ieeexplore.ieee.org/document/8605316/Data preprocessingmissing datamissing data generationmissing data mechanisms
collection DOAJ
language English
format Article
sources DOAJ
author Miriam Seoane Santos
Ricardo Cardoso Pereira
Adriana Fonseca Costa
Jastin Pompeu Soares
Joao Santos
Pedro Henriques Abreu
spellingShingle Miriam Seoane Santos
Ricardo Cardoso Pereira
Adriana Fonseca Costa
Jastin Pompeu Soares
Joao Santos
Pedro Henriques Abreu
Generating Synthetic Missing Data: A Review by Missing Mechanism
IEEE Access
Data preprocessing
missing data
missing data generation
missing data mechanisms
author_facet Miriam Seoane Santos
Ricardo Cardoso Pereira
Adriana Fonseca Costa
Jastin Pompeu Soares
Joao Santos
Pedro Henriques Abreu
author_sort Miriam Seoane Santos
title Generating Synthetic Missing Data: A Review by Missing Mechanism
title_short Generating Synthetic Missing Data: A Review by Missing Mechanism
title_full Generating Synthetic Missing Data: A Review by Missing Mechanism
title_fullStr Generating Synthetic Missing Data: A Review by Missing Mechanism
title_full_unstemmed Generating Synthetic Missing Data: A Review by Missing Mechanism
title_sort generating synthetic missing data: a review by missing mechanism
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description The performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct missing mechanisms, namely, missing completely at random, missing at random, and missing not at random. Since the missing data generation process defines the basis for the imputation experiments (configuration, missing rate, and missing mechanism), it is essential that it is appropriately applied; otherwise, conclusions derived from ill-defined setups may be invalid. The goal of this paper is to review the different approaches to synthetic missing data generation found in the literature and discuss their practical details, elaborating on their strengths and weaknesses. Our analysis revealed that creating missing at random and missing not at random scenarios in datasets comprising qualitative features is the most challenging issue in the related work and, therefore, should be the focus of future work in the field.
topic Data preprocessing
missing data
missing data generation
missing data mechanisms
url https://ieeexplore.ieee.org/document/8605316/
work_keys_str_mv AT miriamseoanesantos generatingsyntheticmissingdataareviewbymissingmechanism
AT ricardocardosopereira generatingsyntheticmissingdataareviewbymissingmechanism
AT adrianafonsecacosta generatingsyntheticmissingdataareviewbymissingmechanism
AT jastinpompeusoares generatingsyntheticmissingdataareviewbymissingmechanism
AT joaosantos generatingsyntheticmissingdataareviewbymissingmechanism
AT pedrohenriquesabreu generatingsyntheticmissingdataareviewbymissingmechanism
_version_ 1724192285593174016