High quality linkage using Multibit Trees for privacy-preserving blocking

ABSTRACT Objectives As privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit...

Full description

Bibliographic Details
Main Authors: Adrian Brown, Christian Borgs, Sean Randall, Rainer Schnell
Format: Article
Language:English
Published: Swansea University 2017-04-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/149
id doaj-9a84606e654e4a75909f89c4f191b96e
record_format Article
spelling doaj-9a84606e654e4a75909f89c4f191b96e2020-11-25T01:23:29ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.149149High quality linkage using Multibit Trees for privacy-preserving blockingAdrian Brown0Christian Borgs1Sean Randall2Rainer Schnell3Curtin UniversityUniversity of Duisburg-EssenCurtin UniversityUniversity of Duisburg-EssenABSTRACT Objectives As privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit Trees, and applies this method to real-world datasets. Approach Data comprised ten years of hospital and mortality records from several Australian states, totalling over 25 million records. Each record contained a linkage key, as defined by the jurisdiction, which was used to assess quality (i.e. used as a ‘gold standard’). Different parameter sets were defined for the linkage tests with a privacy-preserved file created for each parameter set. The files contained jurisdictional linkage key and a Cryptographic Long-term Key (the CLK is a Bloom filter comprising all fields in the parameter set). Each file was run through an implementation of the Q-gram Fingerprinting blocking algorithm as a deduplication technique, using different similarity thresholds. The quality metrics of precision, recall and f-measure were calculated. Results Resultant quality varied for each parameter set. Adding suburb and postcode reduced the linkage quality. The best parameter set returned an F-measure of 0.951. In general, precision was high in all settings, but recall fell as more fields were added to the CLK. We will report details for all parameter settings and their corresponding results. Conclusion The Q-gram Fingerprinting blocking technique shows promise for maintaining high quality linkage in reasonable time. Determining which fields to include in the CLK for the linkage of specific datasets is important to maximise linkage quality, as well as selecting optimal similarity thresholds. Developing new technology is important for progressing the implementation of PPRL in real-world settings.https://ijpds.org/article/view/149
collection DOAJ
language English
format Article
sources DOAJ
author Adrian Brown
Christian Borgs
Sean Randall
Rainer Schnell
spellingShingle Adrian Brown
Christian Borgs
Sean Randall
Rainer Schnell
High quality linkage using Multibit Trees for privacy-preserving blocking
International Journal of Population Data Science
author_facet Adrian Brown
Christian Borgs
Sean Randall
Rainer Schnell
author_sort Adrian Brown
title High quality linkage using Multibit Trees for privacy-preserving blocking
title_short High quality linkage using Multibit Trees for privacy-preserving blocking
title_full High quality linkage using Multibit Trees for privacy-preserving blocking
title_fullStr High quality linkage using Multibit Trees for privacy-preserving blocking
title_full_unstemmed High quality linkage using Multibit Trees for privacy-preserving blocking
title_sort high quality linkage using multibit trees for privacy-preserving blocking
publisher Swansea University
series International Journal of Population Data Science
issn 2399-4908
publishDate 2017-04-01
description ABSTRACT Objectives As privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit Trees, and applies this method to real-world datasets. Approach Data comprised ten years of hospital and mortality records from several Australian states, totalling over 25 million records. Each record contained a linkage key, as defined by the jurisdiction, which was used to assess quality (i.e. used as a ‘gold standard’). Different parameter sets were defined for the linkage tests with a privacy-preserved file created for each parameter set. The files contained jurisdictional linkage key and a Cryptographic Long-term Key (the CLK is a Bloom filter comprising all fields in the parameter set). Each file was run through an implementation of the Q-gram Fingerprinting blocking algorithm as a deduplication technique, using different similarity thresholds. The quality metrics of precision, recall and f-measure were calculated. Results Resultant quality varied for each parameter set. Adding suburb and postcode reduced the linkage quality. The best parameter set returned an F-measure of 0.951. In general, precision was high in all settings, but recall fell as more fields were added to the CLK. We will report details for all parameter settings and their corresponding results. Conclusion The Q-gram Fingerprinting blocking technique shows promise for maintaining high quality linkage in reasonable time. Determining which fields to include in the CLK for the linkage of specific datasets is important to maximise linkage quality, as well as selecting optimal similarity thresholds. Developing new technology is important for progressing the implementation of PPRL in real-world settings.
url https://ijpds.org/article/view/149
work_keys_str_mv AT adrianbrown highqualitylinkageusingmultibittreesforprivacypreservingblocking
AT christianborgs highqualitylinkageusingmultibittreesforprivacypreservingblocking
AT seanrandall highqualitylinkageusingmultibittreesforprivacypreservingblocking
AT rainerschnell highqualitylinkageusingmultibittreesforprivacypreservingblocking
_version_ 1725121954403319808