Research Ready Data Lakes: Protecting Privacy in Relatable Datasets

Background with rationale The Georgia Policy Labs’ mission is to improve outcomes for children and families by producing rigorous research with long-term government partners. A key component of this model is having secure access to research-ready, individual level data from multiple sources to answ...

Full description

Bibliographic Details
Main Authors: Robert McMillan, Maggie Reeves
Format: Article
Language:English
Published: Swansea University 2019-11-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/1266
id doaj-7c8147addcad42bca8608a371cef954b
record_format Article
spelling doaj-7c8147addcad42bca8608a371cef954b2020-11-25T02:21:30ZengSwansea UniversityInternational Journal of Population Data Science2399-49082019-11-014310.23889/ijpds.v4i3.1266Research Ready Data Lakes: Protecting Privacy in Relatable DatasetsRobert McMillan0Maggie Reeves1Georgia State University, Andrew Young School of Policy Studies, Georgia Policy LabsGeorgia State University, Andrew Young School of Policy Studies, Georgia Policy Labs Background with rationale The Georgia Policy Labs’ mission is to improve outcomes for children and families by producing rigorous research with long-term government partners. A key component of this model is having secure access to research-ready, individual level data from multiple sources to answer government agencies’ questions within policy windows. Obtaining sensitive data from our partners requires significant relationship building, demonstrations of value, and assurances of our ability to mitigate all security and privacy concerns. Objectives • Securely transfer and de-identify disparate individual level datasets with personally identifiable information from public entities. • Clean data and store in a pristine data lake, made available for fast turn-around research. • Ensure individual records can be matched across disparate organizations’ datasets. Approach Our practices, infrastructure, data sharing agreements and security are built to support the intersection of data availability for researchers and security standards that give our partners ease. We highlight two solutions addressing security concerns while supporting our researchers, which can be used by other researchers using sensitive data. First, we discuss our multiple tiers of transfer and access that remove risk from identifiable data. Second, we share the double hash solution created for a partner who was not willing to share PII. We share the source code for our SHA3-512 double hash solution, which allows for matching of records across disparate datasets without receiving PII sensitive elements. Results We created reliable matching values without the need for the actual social security numbers or other PII values on our side, enabling a large school district to share its student-level data with us. Conclusion The balance of security and easy access for researchers is a common area of friction. Our security set-up and hashing solution allows others to remove this barrier for applied policy research. https://ijpds.org/article/view/1266
collection DOAJ
language English
format Article
sources DOAJ
author Robert McMillan
Maggie Reeves
spellingShingle Robert McMillan
Maggie Reeves
Research Ready Data Lakes: Protecting Privacy in Relatable Datasets
International Journal of Population Data Science
author_facet Robert McMillan
Maggie Reeves
author_sort Robert McMillan
title Research Ready Data Lakes: Protecting Privacy in Relatable Datasets
title_short Research Ready Data Lakes: Protecting Privacy in Relatable Datasets
title_full Research Ready Data Lakes: Protecting Privacy in Relatable Datasets
title_fullStr Research Ready Data Lakes: Protecting Privacy in Relatable Datasets
title_full_unstemmed Research Ready Data Lakes: Protecting Privacy in Relatable Datasets
title_sort research ready data lakes: protecting privacy in relatable datasets
publisher Swansea University
series International Journal of Population Data Science
issn 2399-4908
publishDate 2019-11-01
description Background with rationale The Georgia Policy Labs’ mission is to improve outcomes for children and families by producing rigorous research with long-term government partners. A key component of this model is having secure access to research-ready, individual level data from multiple sources to answer government agencies’ questions within policy windows. Obtaining sensitive data from our partners requires significant relationship building, demonstrations of value, and assurances of our ability to mitigate all security and privacy concerns. Objectives • Securely transfer and de-identify disparate individual level datasets with personally identifiable information from public entities. • Clean data and store in a pristine data lake, made available for fast turn-around research. • Ensure individual records can be matched across disparate organizations’ datasets. Approach Our practices, infrastructure, data sharing agreements and security are built to support the intersection of data availability for researchers and security standards that give our partners ease. We highlight two solutions addressing security concerns while supporting our researchers, which can be used by other researchers using sensitive data. First, we discuss our multiple tiers of transfer and access that remove risk from identifiable data. Second, we share the double hash solution created for a partner who was not willing to share PII. We share the source code for our SHA3-512 double hash solution, which allows for matching of records across disparate datasets without receiving PII sensitive elements. Results We created reliable matching values without the need for the actual social security numbers or other PII values on our side, enabling a large school district to share its student-level data with us. Conclusion The balance of security and easy access for researchers is a common area of friction. Our security set-up and hashing solution allows others to remove this barrier for applied policy research.
url https://ijpds.org/article/view/1266
work_keys_str_mv AT robertmcmillan researchreadydatalakesprotectingprivacyinrelatabledatasets
AT maggiereeves researchreadydatalakesprotectingprivacyinrelatabledatasets
_version_ 1724865705821601792