De-identifying a public use microdata file from the Canadian national discharge abstract database

Abstract Background The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform...

Full description

Bibliographic Details
Main Authors:	Paton David, Emam Khaled El, Dankar Fida, Koru Gunes
Format:	Article
Language:	English
Published:	BMC 2011-08-01
Series:	BMC Medical Informatics and Decision Making
Online Access:	http://www.biomedcentral.com/1472-6947/11/53

id	doaj-1831eb3c80c9406a9a8e8a2ef48fda4d
record_format	Article
spelling	doaj-1831eb3c80c9406a9a8e8a2ef48fda4d2020-11-25T00:38:29ZengBMCBMC Medical Informatics and Decision Making1472-69472011-08-011115310.1186/1472-6947-11-53De-identifying a public use microdata file from the Canadian national discharge abstract databasePaton DavidEmam Khaled ElDankar FidaKoru Gunes<p>Abstract</p> <p>Background</p> <p>The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records.</p> <p>Methods</p> <p>Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.</p> <p>Results</p> <p>Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.</p> <p>Conclusions</p> <p>The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p> http://www.biomedcentral.com/1472-6947/11/53
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Paton David Emam Khaled El Dankar Fida Koru Gunes
spellingShingle	Paton David Emam Khaled El Dankar Fida Koru Gunes De-identifying a public use microdata file from the Canadian national discharge abstract database BMC Medical Informatics and Decision Making
author_facet	Paton David Emam Khaled El Dankar Fida Koru Gunes
author_sort	Paton David
title	De-identifying a public use microdata file from the Canadian national discharge abstract database
title_short	De-identifying a public use microdata file from the Canadian national discharge abstract database
title_full	De-identifying a public use microdata file from the Canadian national discharge abstract database
title_fullStr	De-identifying a public use microdata file from the Canadian national discharge abstract database
title_full_unstemmed	De-identifying a public use microdata file from the Canadian national discharge abstract database
title_sort	de-identifying a public use microdata file from the canadian national discharge abstract database
publisher	BMC
series	BMC Medical Informatics and Decision Making
issn	1472-6947
publishDate	2011-08-01
description	<p>Abstract</p> <p>Background</p> <p>The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records.</p> <p>Methods</p> <p>Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.</p> <p>Results</p> <p>Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.</p> <p>Conclusions</p> <p>The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p>
url	http://www.biomedcentral.com/1472-6947/11/53
work_keys_str_mv	AT patondavid deidentifyingapublicusemicrodatafilefromthecanadiannationaldischargeabstractdatabase AT emamkhaledel deidentifyingapublicusemicrodatafilefromthecanadiannationaldischargeabstractdatabase AT dankarfida deidentifyingapublicusemicrodatafilefromthecanadiannationaldischargeabstractdatabase AT korugunes deidentifyingapublicusemicrodatafilefromthecanadiannationaldischargeabstractdatabase
_version_	1725297267869483008

De-identifying a public use microdata file from the Canadian national discharge abstract database

Similar Items