Medical record linkage in health information systems by approximate string matching and clustering

Abstract Background Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity r...

Full description

Bibliographic Details
Main Authors:	Buemi Antoine, Paumier Jean-Philippe, Sauleau Erik A
Format:	Article
Language:	English
Published:	BMC 2005-10-01
Series:	BMC Medical Informatics and Decision Making
Online Access:	http://www.biomedcentral.com/1472-6947/5/32

id	doaj-687384e5a67c414bbf7f00d9a9ae6c44
record_format	Article
spelling	doaj-687384e5a67c414bbf7f00d9a9ae6c442020-11-25T02:28:17ZengBMCBMC Medical Informatics and Decision Making1472-69472005-10-01513210.1186/1472-6947-5-32Medical record linkage in health information systems by approximate string matching and clusteringBuemi AntoinePaumier Jean-PhilippeSauleau Erik A<p>Abstract</p> <p>Background</p> <p>Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity.</p> <p>Methods</p> <p>The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods.</p> <p>Results</p> <p>The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records.</p> <p>Conclusion</p> <p>Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity.</p> http://www.biomedcentral.com/1472-6947/5/32
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Buemi Antoine Paumier Jean-Philippe Sauleau Erik A
spellingShingle	Buemi Antoine Paumier Jean-Philippe Sauleau Erik A Medical record linkage in health information systems by approximate string matching and clustering BMC Medical Informatics and Decision Making
author_facet	Buemi Antoine Paumier Jean-Philippe Sauleau Erik A
author_sort	Buemi Antoine
title	Medical record linkage in health information systems by approximate string matching and clustering
title_short	Medical record linkage in health information systems by approximate string matching and clustering
title_full	Medical record linkage in health information systems by approximate string matching and clustering
title_fullStr	Medical record linkage in health information systems by approximate string matching and clustering
title_full_unstemmed	Medical record linkage in health information systems by approximate string matching and clustering
title_sort	medical record linkage in health information systems by approximate string matching and clustering
publisher	BMC
series	BMC Medical Informatics and Decision Making
issn	1472-6947
publishDate	2005-10-01
description	<p>Abstract</p> <p>Background</p> <p>Multiplication of data sources within heterogeneous healthcare information systems always results in redundant information, split among multiple databases. Our objective is to detect exact and approximate duplicates within identity records, in order to attain a better quality of information and to permit cross-linkage among stand-alone and clustered databases. Furthermore, we need to assist human decision making, by computing a value reflecting identity proximity.</p> <p>Methods</p> <p>The proposed method is in three steps. The first step is to standardise and to index elementary identity fields, using blocking variables, in order to speed up information analysis. The second is to match similar pair records, relying on a global similarity value taken from the Porter-Jaro-Winkler algorithm. And the third is to create clusters of coherent related records, using graph drawing, agglomerative clustering methods and partitioning methods.</p> <p>Results</p> <p>The batch analysis of 300,000 "supposedly" distinct identities isolates 240,000 true unique records, 24,000 duplicates (clusters composed of 2 records) and 3,000 clusters whose size is greater than or equal to 3 records.</p> <p>Conclusion</p> <p>Duplicate-free databases, used in conjunction with relevant indexes and similarity values, allow immediate (i.e.: real-time) proximity detection when inserting a new identity.</p>
url	http://www.biomedcentral.com/1472-6947/5/32
work_keys_str_mv	AT buemiantoine medicalrecordlinkageinhealthinformationsystemsbyapproximatestringmatchingandclustering AT paumierjeanphilippe medicalrecordlinkageinhealthinformationsystemsbyapproximatestringmatchingandclustering AT sauleauerika medicalrecordlinkageinhealthinformationsystemsbyapproximatestringmatchingandclustering
_version_	1724839352972869632

Medical record linkage in health information systems by approximate string matching and clustering

Similar Items