High quality linkage using Multibit Trees for privacy-preserving blocking
ABSTRACT Objectives As privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2017-04-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/149 |
id |
doaj-9a84606e654e4a75909f89c4f191b96e |
---|---|
record_format |
Article |
spelling |
doaj-9a84606e654e4a75909f89c4f191b96e2020-11-25T01:23:29ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.149149High quality linkage using Multibit Trees for privacy-preserving blockingAdrian Brown0Christian Borgs1Sean Randall2Rainer Schnell3Curtin UniversityUniversity of Duisburg-EssenCurtin UniversityUniversity of Duisburg-EssenABSTRACT Objectives As privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit Trees, and applies this method to real-world datasets. Approach Data comprised ten years of hospital and mortality records from several Australian states, totalling over 25 million records. Each record contained a linkage key, as defined by the jurisdiction, which was used to assess quality (i.e. used as a ‘gold standard’). Different parameter sets were defined for the linkage tests with a privacy-preserved file created for each parameter set. The files contained jurisdictional linkage key and a Cryptographic Long-term Key (the CLK is a Bloom filter comprising all fields in the parameter set). Each file was run through an implementation of the Q-gram Fingerprinting blocking algorithm as a deduplication technique, using different similarity thresholds. The quality metrics of precision, recall and f-measure were calculated. Results Resultant quality varied for each parameter set. Adding suburb and postcode reduced the linkage quality. The best parameter set returned an F-measure of 0.951. In general, precision was high in all settings, but recall fell as more fields were added to the CLK. We will report details for all parameter settings and their corresponding results. Conclusion The Q-gram Fingerprinting blocking technique shows promise for maintaining high quality linkage in reasonable time. Determining which fields to include in the CLK for the linkage of specific datasets is important to maximise linkage quality, as well as selecting optimal similarity thresholds. Developing new technology is important for progressing the implementation of PPRL in real-world settings.https://ijpds.org/article/view/149 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Adrian Brown Christian Borgs Sean Randall Rainer Schnell |
spellingShingle |
Adrian Brown Christian Borgs Sean Randall Rainer Schnell High quality linkage using Multibit Trees for privacy-preserving blocking International Journal of Population Data Science |
author_facet |
Adrian Brown Christian Borgs Sean Randall Rainer Schnell |
author_sort |
Adrian Brown |
title |
High quality linkage using Multibit Trees for privacy-preserving blocking |
title_short |
High quality linkage using Multibit Trees for privacy-preserving blocking |
title_full |
High quality linkage using Multibit Trees for privacy-preserving blocking |
title_fullStr |
High quality linkage using Multibit Trees for privacy-preserving blocking |
title_full_unstemmed |
High quality linkage using Multibit Trees for privacy-preserving blocking |
title_sort |
high quality linkage using multibit trees for privacy-preserving blocking |
publisher |
Swansea University |
series |
International Journal of Population Data Science |
issn |
2399-4908 |
publishDate |
2017-04-01 |
description |
ABSTRACT
Objectives
As privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit Trees, and applies this method to real-world datasets.
Approach
Data comprised ten years of hospital and mortality records from several Australian states, totalling over 25 million records. Each record contained a linkage key, as defined by the jurisdiction, which was used to assess quality (i.e. used as a ‘gold standard’). Different parameter sets were defined for the linkage tests with a privacy-preserved file created for each parameter set. The files contained jurisdictional linkage key and a Cryptographic Long-term Key (the CLK is a Bloom filter comprising all fields in the parameter set).
Each file was run through an implementation of the Q-gram Fingerprinting blocking algorithm as a deduplication technique, using different similarity thresholds. The quality metrics of precision, recall and f-measure were calculated.
Results
Resultant quality varied for each parameter set. Adding suburb and postcode reduced the linkage quality. The best parameter set returned an F-measure of 0.951. In general, precision was high in all settings, but recall fell as more fields were added to the CLK. We will report details for all parameter settings and their corresponding results.
Conclusion
The Q-gram Fingerprinting blocking technique shows promise for maintaining high quality linkage in reasonable time. Determining which fields to include in the CLK for the linkage of specific datasets is important to maximise linkage quality, as well as selecting optimal similarity thresholds. Developing new technology is important for progressing the implementation of PPRL in real-world settings. |
url |
https://ijpds.org/article/view/149 |
work_keys_str_mv |
AT adrianbrown highqualitylinkageusingmultibittreesforprivacypreservingblocking AT christianborgs highqualitylinkageusingmultibittreesforprivacypreservingblocking AT seanrandall highqualitylinkageusingmultibittreesforprivacypreservingblocking AT rainerschnell highqualitylinkageusingmultibittreesforprivacypreservingblocking |
_version_ |
1725121954403319808 |