Evaluation of different approaches for missing data imputation on features associated to genomic data

Background: Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. Th...

Full description

Bibliographic Details
Main Authors:	Lopez-Bello, F. (Author), Naya, H. (Author), Petrazzini, B.O (Author), Spangenberg, L. (Author), Vazquez, G. (Author)
Format:	Article
Language:	English
Published:	BioMed Central Ltd 2021
Subjects:	article genomics human imputation Machine learning missing data pathogenic variants profit random forest screening test
Online Access:	View Fulltext in Publisher


LEADER	02401nam a2200313Ia 4500
001	10.1186-s13040-021-00274-7
008	220427s2021 CNT 000 0 und d
020			\|a 17560381 (ISSN)
245	1	0	\|a Evaluation of different approaches for missing data imputation on features associated to genomic data
260		0	\|b BioMed Central Ltd \|c 2021
856			\|z View Fulltext in Publisher \|u https://doi.org/10.1186/s13040-021-00274-7
520	3		\|a Background: Missing data is a common issue in different fields, such as electronics, image processing, medical records and genomics. They can limit or even bias the posterior analysis. The data collection process can lead to different distribution, frequency, and structure of missing data points. They can be classified into four categories: Structurally Missing Data (SMD), Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). For the three later, and in the context of genomic data (especially non-coding data), we will discuss six imputation approaches using 31,245 variants collected from ClinVar and annotated with 13 genome-wide features. Results: Random Forest and kNN algorithms showed the best performance in the evaluated dataset. Additionally, some features show robust imputation regardless of the algorithm (e.g. conservation scores phyloP7 and phyloP20), while other features show poor imputation across algorithms (e.g. PhasCons). We also developed an R package that helps to test which imputation method is the best for a particular data set. Conclusions: We found that Random Forest and kNN are the best imputation method for genomics data, including non-coding variants. Since Random Forest is computationally more challenging, kNN remains a more realistic approach. Future work on variant prioritization thru genomic screening tests could largely profit from this methodology. © 2021, The Author(s).
650	0	4	\|a article
650	0	4	\|a genomics
650	0	4	\|a genomics
650	0	4	\|a human
650	0	4	\|a imputation
650	0	4	\|a Machine learning
650	0	4	\|a missing data
650	0	4	\|a pathogenic variants
650	0	4	\|a profit
650	0	4	\|a random forest
650	0	4	\|a screening test
700	1		\|a Lopez-Bello, F. \|e author
700	1		\|a Naya, H. \|e author
700	1		\|a Petrazzini, B.O. \|e author
700	1		\|a Spangenberg, L. \|e author
700	1		\|a Vazquez, G. \|e author
773			\|t BioData Mining

Evaluation of different approaches for missing data imputation on features associated to genomic data

Similar Items