RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.

Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and tre...

Full description

Bibliographic Details
Main Authors: Ji-Sung Kim, Xin Gao, Andrey Rzhetsky
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2018-04-01
Series:PLoS Computational Biology
Online Access:http://europepmc.org/articles/PMC5940243?pdf=render
id doaj-26cee1abe029436ebfc395331d19c120
record_format Article
spelling doaj-26cee1abe029436ebfc395331d19c1202020-11-25T01:53:40ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582018-04-01144e100610610.1371/journal.pcbi.1006106RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.Ji-Sung KimXin GaoAndrey RzhetskyAnonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.http://europepmc.org/articles/PMC5940243?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Ji-Sung Kim
Xin Gao
Andrey Rzhetsky
spellingShingle Ji-Sung Kim
Xin Gao
Andrey Rzhetsky
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.
PLoS Computational Biology
author_facet Ji-Sung Kim
Xin Gao
Andrey Rzhetsky
author_sort Ji-Sung Kim
title RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.
title_short RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.
title_full RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.
title_fullStr RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.
title_full_unstemmed RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning.
title_sort riddle: race and ethnicity imputation from disease history with deep learning.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2018-04-01
description Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10-9). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
url http://europepmc.org/articles/PMC5940243?pdf=render
work_keys_str_mv AT jisungkim riddleraceandethnicityimputationfromdiseasehistorywithdeeplearning
AT xingao riddleraceandethnicityimputationfromdiseasehistorywithdeeplearning
AT andreyrzhetsky riddleraceandethnicityimputationfromdiseasehistorywithdeeplearning
_version_ 1724989897863856128