An efficient record linkage scheme using graphical analysis for identifier error detection

Abstract Background Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of recor...

Full description

Bibliographic Details
Main Authors:	Peto Tim EA, Walker A, Finney John M, Wyllie David H
Format:	Article
Language:	English
Published:	BMC 2011-02-01
Series:	BMC Medical Informatics and Decision Making
Online Access:	http://www.biomedcentral.com/1472-6947/11/7

id	doaj-3e37f170b56a42c8be8dc50634701e16
record_format	Article
spelling	doaj-3e37f170b56a42c8be8dc50634701e162020-11-25T00:23:17ZengBMCBMC Medical Informatics and Decision Making1472-69472011-02-01111710.1186/1472-6947-11-7An efficient record linkage scheme using graphical analysis for identifier error detectionPeto Tim EAWalker AFinney John MWyllie David H<p>Abstract</p> <p>Background</p> <p>Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone.</p> <p>Methods</p> <p>We describe a two-step record linkage algorithm in which identifiers with high cardinality are identified or generated, and used to perform an initial exact match based linkage. Subsequently, the resulting clusters are studied and, if appropriate, partitioned using a graph based algorithm detecting erroneous identifiers.</p> <p>Results</p> <p>The system was used to cluster over 250 million health records from five data sources within a large UK hospital group. Linkage, which was completed in about 30 minutes, yielded 3.6 million clusters of which about 99.8% contain, with high likelihood, records from one patient. Although computationally efficient, the algorithm's requirement for exact matching of at least one identifier of each record to another for cluster formation may be a limitation in some databases containing records of low identifier quality.</p> <p>Conclusions</p> <p>The technique described offers a simple, fast and highly efficient two-step method for large scale initial linkage for records commonly found in the UK's National Health Service.</p> http://www.biomedcentral.com/1472-6947/11/7
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Peto Tim EA Walker A Finney John M Wyllie David H
spellingShingle	Peto Tim EA Walker A Finney John M Wyllie David H An efficient record linkage scheme using graphical analysis for identifier error detection BMC Medical Informatics and Decision Making
author_facet	Peto Tim EA Walker A Finney John M Wyllie David H
author_sort	Peto Tim EA
title	An efficient record linkage scheme using graphical analysis for identifier error detection
title_short	An efficient record linkage scheme using graphical analysis for identifier error detection
title_full	An efficient record linkage scheme using graphical analysis for identifier error detection
title_fullStr	An efficient record linkage scheme using graphical analysis for identifier error detection
title_full_unstemmed	An efficient record linkage scheme using graphical analysis for identifier error detection
title_sort	efficient record linkage scheme using graphical analysis for identifier error detection
publisher	BMC
series	BMC Medical Informatics and Decision Making
issn	1472-6947
publishDate	2011-02-01
description	<p>Abstract</p> <p>Background</p> <p>Integration of information on individuals (record linkage) is a key problem in healthcare delivery, epidemiology, and "business intelligence" applications. It is now common to be required to link very large numbers of records, often containing various combinations of theoretically unique identifiers, such as NHS numbers, which are both incomplete and error-prone.</p> <p>Methods</p> <p>We describe a two-step record linkage algorithm in which identifiers with high cardinality are identified or generated, and used to perform an initial exact match based linkage. Subsequently, the resulting clusters are studied and, if appropriate, partitioned using a graph based algorithm detecting erroneous identifiers.</p> <p>Results</p> <p>The system was used to cluster over 250 million health records from five data sources within a large UK hospital group. Linkage, which was completed in about 30 minutes, yielded 3.6 million clusters of which about 99.8% contain, with high likelihood, records from one patient. Although computationally efficient, the algorithm's requirement for exact matching of at least one identifier of each record to another for cluster formation may be a limitation in some databases containing records of low identifier quality.</p> <p>Conclusions</p> <p>The technique described offers a simple, fast and highly efficient two-step method for large scale initial linkage for records commonly found in the UK's National Health Service.</p>
url	http://www.biomedcentral.com/1472-6947/11/7
work_keys_str_mv	AT petotimea anefficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection AT walkera anefficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection AT finneyjohnm anefficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection AT wylliedavidh anefficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection AT petotimea efficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection AT walkera efficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection AT finneyjohnm efficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection AT wylliedavidh efficientrecordlinkageschemeusinggraphicalanalysisforidentifiererrordetection
_version_	1725357831516848128

An efficient record linkage scheme using graphical analysis for identifier error detection

Similar Items