Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

BackgroundFinancial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic ident...

Full description

Bibliographic Details
Main Authors: Maarseveen, Tjardo D, Meinderink, Timo, Reinders, Marcel J T, Knitza, Johannes, Huizinga, Tom W J, Kleyer, Arnd, Simon, David, van den Akker, Erik B, Knevel, Rachel
Format: Article
Language:English
Published: JMIR Publications 2020-11-01
Series:JMIR Medical Informatics
Online Access:http://medinform.jmir.org/2020/11/e23930/
id doaj-622f8649116840d89c01d5f489b68dae
record_format Article
spelling doaj-622f8649116840d89c01d5f489b68dae2021-05-03T02:53:32ZengJMIR PublicationsJMIR Medical Informatics2291-96942020-11-01811e2393010.2196/23930Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation StudyMaarseveen, Tjardo DMeinderink, TimoReinders, Marcel J TKnitza, JohannesHuizinga, Tom W JKleyer, ArndSimon, Davidvan den Akker, Erik BKnevel, Rachel BackgroundFinancial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. ObjectiveThe aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. MethodsTwo electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. ResultsFor the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). ConclusionsWe demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.http://medinform.jmir.org/2020/11/e23930/
collection DOAJ
language English
format Article
sources DOAJ
author Maarseveen, Tjardo D
Meinderink, Timo
Reinders, Marcel J T
Knitza, Johannes
Huizinga, Tom W J
Kleyer, Arnd
Simon, David
van den Akker, Erik B
Knevel, Rachel
spellingShingle Maarseveen, Tjardo D
Meinderink, Timo
Reinders, Marcel J T
Knitza, Johannes
Huizinga, Tom W J
Kleyer, Arnd
Simon, David
van den Akker, Erik B
Knevel, Rachel
Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
JMIR Medical Informatics
author_facet Maarseveen, Tjardo D
Meinderink, Timo
Reinders, Marcel J T
Knitza, Johannes
Huizinga, Tom W J
Kleyer, Arnd
Simon, David
van den Akker, Erik B
Knevel, Rachel
author_sort Maarseveen, Tjardo D
title Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_short Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_full Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_fullStr Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_full_unstemmed Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study
title_sort machine learning electronic health record identification of patients with rheumatoid arthritis: algorithm pipeline development and validation study
publisher JMIR Publications
series JMIR Medical Informatics
issn 2291-9694
publishDate 2020-11-01
description BackgroundFinancial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. ObjectiveThe aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. MethodsTwo electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. ResultsFor the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). ConclusionsWe demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.
url http://medinform.jmir.org/2020/11/e23930/
work_keys_str_mv AT maarseveentjardod machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT meinderinktimo machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT reindersmarceljt machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT knitzajohannes machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT huizingatomwj machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT kleyerarnd machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT simondavid machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT vandenakkererikb machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
AT knevelrachel machinelearningelectronichealthrecordidentificationofpatientswithrheumatoidarthritisalgorithmpipelinedevelopmentandvalidationstudy
_version_ 1721484975533981696