Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives
It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free tex...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Mehran University of Engineering and Technology
2020-07-01
|
Series: | Mehran University Research Journal of Engineering and Technology |
Online Access: | https://publications.muet.edu.pk/index.php/muetrj/article/view/1704 |
id |
doaj-e31157b2e2f04937b4992e57a3992ab8 |
---|---|
record_format |
Article |
spelling |
doaj-e31157b2e2f04937b4992e57a3992ab82020-11-25T02:36:36ZengMehran University of Engineering and TechnologyMehran University Research Journal of Engineering and Technology0254-78212413-72192020-07-0139361262410.22581/muet1982.2003.161704Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical NarrativesSaman Hina0Raheela Asif1Syed Abbas Ali2Department of Computer Science and Information Technology, NED University of Engineering and Technology, Karachi, Pakistan.Department of Software Engineering, NED University of Engineering and Technology, Karachi, Pakistan.Department of Computer and Information Systems Engineering, NED University of Engineering and Technology, Karachi, Pakistan.It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.https://publications.muet.edu.pk/index.php/muetrj/article/view/1704 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Saman Hina Raheela Asif Syed Abbas Ali |
spellingShingle |
Saman Hina Raheela Asif Syed Abbas Ali Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives Mehran University Research Journal of Engineering and Technology |
author_facet |
Saman Hina Raheela Asif Syed Abbas Ali |
author_sort |
Saman Hina |
title |
Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives |
title_short |
Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives |
title_full |
Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives |
title_fullStr |
Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives |
title_full_unstemmed |
Anonymization Framework for Securing Protected Health Information in a Complex Dataset of Medical Narratives |
title_sort |
anonymization framework for securing protected health information in a complex dataset of medical narratives |
publisher |
Mehran University of Engineering and Technology |
series |
Mehran University Research Journal of Engineering and Technology |
issn |
0254-7821 2413-7219 |
publishDate |
2020-07-01 |
description |
It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research. |
url |
https://publications.muet.edu.pk/index.php/muetrj/article/view/1704 |
work_keys_str_mv |
AT samanhina anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives AT raheelaasif anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives AT syedabbasali anonymizationframeworkforsecuringprotectedhealthinformationinacomplexdatasetofmedicalnarratives |
_version_ |
1724799215510487040 |