Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.
Electronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for pr...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2020-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0237911 |
id |
doaj-25146c0741594b09873adca29c9cc8bd |
---|---|
record_format |
Article |
spelling |
doaj-25146c0741594b09873adca29c9cc8bd2021-03-03T22:05:03ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01158e023791110.1371/journal.pone.0237911Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations.Rebecka WeegarKarin SundströmElectronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for predicting cervical cancer, a serious disease where early detection is beneficial for the outcome of treatment. A case group of 1321 patients with cervical cancer were matched to ten times as many controls, and for both groups several types of events were extracted from their EHRs. These events included clinical codes, lab results, and contents of free text notes retrieved using a LSTM neural network. Clinical events are described with great variation in EHR texts, leading to a very large feature space. Therefore, an event hierarchy inferred from the textual events was created to represent the clinical texts. Overall, the events extracted from free text notes contributed the most to the final prediction, and the hierarchy of textual events further improved performance. Four classifiers were evaluated for predicting a future cancer diagnosis where Random Forest achieved the best results with an AUC of 0.70 from a year before diagnosis up to 0.97 one day before diagnosis. We conclude that our approach is sound and had excellent discrimination at diagnosis, but only modest discrimination capacity before this point. Since our study objective was earlier disease prediction than such, we propose further work should consider extending patient histories through e.g. the integration of primary health records preceding referral to hospital.https://doi.org/10.1371/journal.pone.0237911 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Rebecka Weegar Karin Sundström |
spellingShingle |
Rebecka Weegar Karin Sundström Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations. PLoS ONE |
author_facet |
Rebecka Weegar Karin Sundström |
author_sort |
Rebecka Weegar |
title |
Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations. |
title_short |
Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations. |
title_full |
Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations. |
title_fullStr |
Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations. |
title_full_unstemmed |
Using machine learning for predicting cervical cancer from Swedish electronic health records by mining hierarchical representations. |
title_sort |
using machine learning for predicting cervical cancer from swedish electronic health records by mining hierarchical representations. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS ONE |
issn |
1932-6203 |
publishDate |
2020-01-01 |
description |
Electronic health records (EHRs) contain rich documentation regarding disease symptoms and progression, but EHR data is challenging to use for diagnosis prediction due to its high dimensionality, relative scarcity, and substantial level of noise. We investigated how to best represent EHR data for predicting cervical cancer, a serious disease where early detection is beneficial for the outcome of treatment. A case group of 1321 patients with cervical cancer were matched to ten times as many controls, and for both groups several types of events were extracted from their EHRs. These events included clinical codes, lab results, and contents of free text notes retrieved using a LSTM neural network. Clinical events are described with great variation in EHR texts, leading to a very large feature space. Therefore, an event hierarchy inferred from the textual events was created to represent the clinical texts. Overall, the events extracted from free text notes contributed the most to the final prediction, and the hierarchy of textual events further improved performance. Four classifiers were evaluated for predicting a future cancer diagnosis where Random Forest achieved the best results with an AUC of 0.70 from a year before diagnosis up to 0.97 one day before diagnosis. We conclude that our approach is sound and had excellent discrimination at diagnosis, but only modest discrimination capacity before this point. Since our study objective was earlier disease prediction than such, we propose further work should consider extending patient histories through e.g. the integration of primary health records preceding referral to hospital. |
url |
https://doi.org/10.1371/journal.pone.0237911 |
work_keys_str_mv |
AT rebeckaweegar usingmachinelearningforpredictingcervicalcancerfromswedishelectronichealthrecordsbymininghierarchicalrepresentations AT karinsundstrom usingmachinelearningforpredictingcervicalcancerfromswedishelectronichealthrecordsbymininghierarchicalrepresentations |
_version_ |
1714813435542241280 |