Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.

Electronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for pr...

Full description

Bibliographic Details
Main Authors: Rebecka Weegar, Karin Sundström
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0237911
id doaj-25146c0741594b09873adca29c9cc8bd
record_format Article
spelling doaj-25146c0741594b09873adca29c9cc8bd2021-03-03T22:05:03ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01158e023791110.1371/journal.pone.0237911Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.Rebecka WeegarKarin SundströmElectronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for predicting cervical cancer, a serious disease where early detection is beneficial for the outcome of treatment. A case group of 1321 patients with cervical cancer were matched to ten times as many controls, and for both groups several types of events were extracted from their EHRs. These events included clinical codes, lab results, and contents of free text notes retrieved using a LSTM neural network. Clinical events are described with great variation in EHR texts, leading to a very large feature space. Therefore, an event hierarchy inferred from the textual events was created to represent the clinical texts. Overall, the events extracted from free text notes contributed the most to the final prediction, and the hierarchy of textual events further improved performance. Four classifiers were evaluated for predicting a future cancer diagnosis where Random Forest achieved the best results with an AUC of 0.70 from a year before diagnosis up to 0.97 one day before diagnosis. We conclude that our approach is sound and had excellent discrimination at diagnosis, but only modest discrimination capacity before this point. Since our study objective was earlier disease prediction than such, we propose further work should consider extending patient histories through e.g. the integration of primary health records preceding referral to hospital.https://doi.org/10.1371/journal.pone.0237911
collection DOAJ
language English
format Article
sources DOAJ
author Rebecka Weegar
Karin Sundström
spellingShingle Rebecka Weegar
Karin Sundström
Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.
PLoS ONE
author_facet Rebecka Weegar
Karin Sundström
author_sort Rebecka Weegar
title Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.
title_short Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.
title_full Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.
title_fullStr Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.
title_full_unstemmed Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.
title_sort using machine learning for predicting cervical cancer from swedish electronic health records by mining hierarchical representations.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2020-01-01
description Electronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for predicting cervical cancer, a serious disease where early detection is beneficial for the outcome of treatment. A case group of 1321 patients with cervical cancer were matched to ten times as many controls, and for both groups several types of events were extracted from their EHRs. These events included clinical codes, lab results, and contents of free text notes retrieved using a LSTM neural network. Clinical events are described with great variation in EHR texts, leading to a very large feature space. Therefore, an event hierarchy inferred from the textual events was created to represent the clinical texts. Overall, the events extracted from free text notes contributed the most to the final prediction, and the hierarchy of textual events further improved performance. Four classifiers were evaluated for predicting a future cancer diagnosis where Random Forest achieved the best results with an AUC of 0.70 from a year before diagnosis up to 0.97 one day before diagnosis. We conclude that our approach is sound and had excellent discrimination at diagnosis, but only modest discrimination capacity before this point. Since our study objective was earlier disease prediction than such, we propose further work should consider extending patient histories through e.g. the integration of primary health records preceding referral to hospital.
url https://doi.org/10.1371/journal.pone.0237911
work_keys_str_mv AT rebeckaweegar usingmachinelearningforpredictingcervicalcancerfromswedishelectronichealthrecordsbymininghierarchicalrepresentations
AT karinsundstrom usingmachinelearningforpredictingcervicalcancerfromswedishelectronichealthrecordsbymininghierarchicalrepresentations
_version_ 1714813435542241280