Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives

It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free tex...

Full description

Bibliographic Details
Main Authors:	Saman Hina, Raheela Asif, Syed Abbas Ali
Format:	Article
Language:	English
Published:	Mehran University of Engineering and Technology 2020-07-01
Series:	Mehran University Research Journal of Engineering and Technology
Online Access:	https://publications.muet.edu.pk/index.php/muetrj/article/view/1704

id	doaj-e31157b2e2f04937b4992e57a3992ab8
record_format	Article
spelling	doaj-e31157b2e2f04937b4992e57a3992ab82020-11-25T02:36:36ZengMehran University of Engineering and TechnologyMehran University Research Journal of Engineering and Technology0254-78212413-72192020-07-0139361262410.22581/muet1982.2003.161704Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical NarrativesSaman Hina0Raheela Asif1Syed Abbas Ali2Department of Computer Science and Information Technology, NED University of Engineering and Technology, Karachi, Pakistan.Department of Software Engineering, NED University of Engineering and Technology, Karachi, Pakistan.Department of Computer and Information Systems Engineering, NED University of Engineering and Technology, Karachi, Pakistan.It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.https://publications.muet.edu.pk/index.php/muetrj/article/view/1704
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Saman Hina Raheela Asif Syed Abbas Ali
spellingShingle	Saman Hina Raheela Asif Syed Abbas Ali Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives Mehran University Research Journal of Engineering and Technology
author_facet	Saman Hina Raheela Asif Syed Abbas Ali
author_sort	Saman Hina
title	Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_short	Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_full	Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_fullStr	Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_full_unstemmed	Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
title_sort	anonymization framework for securing protected health information in a complex dataset of medical narratives
publisher	Mehran University of Engineering and Technology
series	Mehran University Research Journal of Engineering and Technology
issn	0254-7821 2413-7219
publishDate	2020-07-01
description	It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.
url	https://publications.muet.edu.pk/index.php/muetrj/article/view/1704
work_keys_str_mv	AT samanhina anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives AT raheelaasif anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives AT syedabbasali anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives
_version_	1724799215510487040

Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives

Similar Items