Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study

BackgroundMachine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated l...

Full description

Bibliographic Details
Main Authors:	Cha, Dongchul, Sung, MinDong, Park, Yu-Rang
Format:	Article
Language:	English
Published:	JMIR Publications 2021-06-01
Series:	JMIR Medical Informatics
Online Access:	https://medinform.jmir.org/2021/6/e26598

id	doaj-30e9aca1873c4d7eaab8d9443dd0fb17
record_format	Article
spelling	doaj-30e9aca1873c4d7eaab8d9443dd0fb172021-06-09T12:31:39ZengJMIR PublicationsJMIR Medical Informatics2291-96942021-06-0196e2659810.2196/26598Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility StudyCha, DongchulSung, MinDongPark, Yu-Rang BackgroundMachine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. ObjectiveThe aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. MethodsWe used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). ResultsThe autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. ConclusionsWe proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model.https://medinform.jmir.org/2021/6/e26598
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Cha, Dongchul Sung, MinDong Park, Yu-Rang
spellingShingle	Cha, Dongchul Sung, MinDong Park, Yu-Rang Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study JMIR Medical Informatics
author_facet	Cha, Dongchul Sung, MinDong Park, Yu-Rang
author_sort	Cha, Dongchul
title	Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_short	Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_full	Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_fullStr	Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_full_unstemmed	Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study
title_sort	implementing vertical federated learning using autoencoders: practical application, generalizability, and utility study
publisher	JMIR Publications
series	JMIR Medical Informatics
issn	2291-9694
publishDate	2021-06-01
description	BackgroundMachine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual’s record is scattered among different sites. ObjectiveThe aim of this study was to perform FL on vertically partitioned data to achieve performance comparable to that of centralized models without exposing the raw data. MethodsWe used three different datasets (Adult income, Schwannoma, and eICU datasets) and vertically divided each dataset into different pieces. Following the vertical division of data, overcomplete autoencoder-based model training was performed for each site. Following training, each site’s data were transformed into latent data, which were aggregated for training. A tabular neural network model with categorical embedding was used for training. A centrally based model was used as a baseline model, which was compared to that of FL in terms of accuracy and area under the receiver operating characteristic curve (AUROC). ResultsThe autoencoder-based network successfully transformed the original data into latent representations with no domain knowledge applied. These altered data were different from the original data in terms of the feature space and data distributions, indicating appropriate data security. The loss of performance was minimal when using an overcomplete autoencoder; accuracy loss was 1.2%, 8.89%, and 1.23%, and AUROC loss was 1.1%, 0%, and 1.12% in the Adult income, Schwannoma, and eICU dataset, respectively. ConclusionsWe proposed an autoencoder-based ML model for vertically incomplete data. Since our model is based on unsupervised learning, no domain-specific knowledge is required in individual sites. Under the circumstances where direct data sharing is not available, our approach may be a practical solution enabling both data protection and building a robust model.
url	https://medinform.jmir.org/2021/6/e26598
work_keys_str_mv	AT chadongchul implementingverticalfederatedlearningusingautoencoderspracticalapplicationgeneralizabilityandutilitystudy AT sungmindong implementingverticalfederatedlearningusingautoencoderspracticalapplicationgeneralizabilityandutilitystudy AT parkyurang implementingverticalfederatedlearningusingautoencoderspracticalapplicationgeneralizabilityandutilitystudy
_version_	1721387985793974272

Implementing Vertical Federated Learning Using Autoencoders: Practical Application, Generalizability, and Utility Study

Similar Items