Summary: | A dissertation submitted in partial fulfillment of the requirements for the Masters of Science degree, Faculty of Science, University of the Witwatersrand, 2020 === Every visit to a clinician or healthcare practitioner’s office leads to the generation of medical records for the patient in question. Such provides opportunity for increasingly usable digital medical information on complementary implementations, like biomedical studies, that could allow quicker and thus more participatory research and at the same time safeguarding healthcare services recipients confidentiality by providing adequate and effective safeguard mechanism to protect protected health information (PHI). However, due to legal constraints aiming to safeguard the confidentiality of patients, only a few of clinical investigative students can only access PHI-free medical notes. A general framework for providing necessary patient confidentiality together with 18 types of protected health information (PHIs) that are to be “scrubbed” from de-identified healthcare services recipients notes, is provided by the Health Insurance Portability and Accountability Act (HIPAA) in the United States before these notes and records containing PHIs can be accessed by others or by organizations. In South Africa, the disclosure of protected health information is governed by Protection of Personal Information Act (POPI Act or POPIA). The Protection of Personal Information Act No. 4 of 2013 encourages the protection and prevent unauthorized disclosure by public and private institutions containing personally identifiable information. The South African government ratified the POPIA Acton 19 November and released Notice 37067 on 26 November 2013 in the Government Gazette. The shared de-identified (“PHI-free”) medical information records could help in improving care, estimating medical care costs, and supporting policies on human health. There are innumerable examples that support the need to utilize de-identified records. Clinical text de-identification offers such advantages, however, if performed manually, then such process typically becomes a tediously cumbersome process and prone to errors. Employing deep-learning methods and techniques coupled with methods for Natural Language Processing (NLP) techniques could mitigate the difficulties and make this process smoother. However, I should point out that there has only been seldom any investigation into further use of automatically de-identified medical narration. In doing this, a dependable computerized “PHI removal and sanitizing” system will consequently be of great importance in carrying out this task and vast amount of data contained in medical notes will be made accessible for further research and uses. In this research, I have studied machine learning methods to address the problem of de-identification of textual clinical notes. I examined various options in order to develop methods that can utilize gradient-based algorithms and/or deep learning methods that will achieve accurate and robust Text de-identification of the Clinical notes. In the process, I developed the de-identification LSTM_CRFs model that could fully utilize a clinical corpus which was developed for the 2014 i2b2 challenge and the 2016 CEGS NGRID. Furthermore, I evaluated the performance using clinical notes collected and compared the performance. The results obtained in this study are further compared to five different word embeddings trained from the general English text, de-identified clinical text, and biomedical literature. The developed LSTM_CRFs model for de-identification achieved superior performance compared with other ML-based methods === CK2021
|