P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing Networks

BackgroundUsers of peer-to-peer (P2P) file-sharing networks risk the inadvertent disclosure of personal health information (PHI). In addition to potentially causing harm to the affected individuals, this can heighten the risk of data breaches for health information custodians...

Full description

Bibliographic Details
Main Authors: Sokolova, Marina, El Emam, Khaled, Arbuckle, Luk, Neri, Emilio, Rose, Sean, Jonker, Elizabeth
Format: Article
Language:English
Published: JMIR Publications 2012-07-01
Series:Journal of Medical Internet Research
Online Access:http://www.jmir.org/2012/4/e95/
id doaj-54826c2236b44489a7c4acd32278f926
record_format Article
spelling doaj-54826c2236b44489a7c4acd32278f9262021-04-02T20:02:54ZengJMIR PublicationsJournal of Medical Internet Research1438-88712012-07-01144e9510.2196/jmir.1898P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing NetworksSokolova, MarinaEl Emam, KhaledArbuckle, LukNeri, EmilioRose, SeanJonker, Elizabeth BackgroundUsers of peer-to-peer (P2P) file-sharing networks risk the inadvertent disclosure of personal health information (PHI). In addition to potentially causing harm to the affected individuals, this can heighten the risk of data breaches for health information custodians. Automated PHI detection tools that crawl the P2P networks can identify PHI and alert custodians. While there has been previous work on the detection of personal information in electronic health records, there has been a dearth of research on the automated detection of PHI in heterogeneous user files. ObjectiveTo build a system that accurately detects PHI in files sent through P2P file-sharing networks. The system, which we call P2P Watch, uses a pipeline of text processing techniques to automatically detect PHI in files exchanged through P2P networks. P2P Watch processes unstructured texts regardless of the file format, document type, and content. MethodsWe developed P2P Watch to extract and analyze PHI in text files exchanged on P2P networks. We labeled texts as PHI if they contained identifiable information about a person (eg, name and date of birth) and specifics of the person’s health (eg, diagnosis, prescriptions, and medical procedures). We evaluated the system’s performance through its efficiency and effectiveness on 3924 files gathered from three P2P networks. ResultsP2P Watch successfully processed 3924 P2P files of unknown content. A manual examination of 1578 randomly selected files marked by the system as non-PHI confirmed that these files indeed did not contain PHI, making the false-negative detection rate equal to zero. Of 57 files marked by the system as PHI, all contained both personally identifiable information and health information: 11 files were PHI disclosures, and 46 files contained organizational materials such as unfilled insurance forms, job applications by medical professionals, and essays. ConclusionsPHI can be successfully detected in free-form textual files exchanged through P2P networks. Once the files with PHI are detected, affected individuals or data custodians can be alerted to take remedial action.http://www.jmir.org/2012/4/e95/
collection DOAJ
language English
format Article
sources DOAJ
author Sokolova, Marina
El Emam, Khaled
Arbuckle, Luk
Neri, Emilio
Rose, Sean
Jonker, Elizabeth
spellingShingle Sokolova, Marina
El Emam, Khaled
Arbuckle, Luk
Neri, Emilio
Rose, Sean
Jonker, Elizabeth
P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing Networks
Journal of Medical Internet Research
author_facet Sokolova, Marina
El Emam, Khaled
Arbuckle, Luk
Neri, Emilio
Rose, Sean
Jonker, Elizabeth
author_sort Sokolova, Marina
title P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing Networks
title_short P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing Networks
title_full P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing Networks
title_fullStr P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing Networks
title_full_unstemmed P2P Watch: Personal Health Information Detection in Peer-to-Peer File-Sharing Networks
title_sort p2p watch: personal health information detection in peer-to-peer file-sharing networks
publisher JMIR Publications
series Journal of Medical Internet Research
issn 1438-8871
publishDate 2012-07-01
description BackgroundUsers of peer-to-peer (P2P) file-sharing networks risk the inadvertent disclosure of personal health information (PHI). In addition to potentially causing harm to the affected individuals, this can heighten the risk of data breaches for health information custodians. Automated PHI detection tools that crawl the P2P networks can identify PHI and alert custodians. While there has been previous work on the detection of personal information in electronic health records, there has been a dearth of research on the automated detection of PHI in heterogeneous user files. ObjectiveTo build a system that accurately detects PHI in files sent through P2P file-sharing networks. The system, which we call P2P Watch, uses a pipeline of text processing techniques to automatically detect PHI in files exchanged through P2P networks. P2P Watch processes unstructured texts regardless of the file format, document type, and content. MethodsWe developed P2P Watch to extract and analyze PHI in text files exchanged on P2P networks. We labeled texts as PHI if they contained identifiable information about a person (eg, name and date of birth) and specifics of the person’s health (eg, diagnosis, prescriptions, and medical procedures). We evaluated the system’s performance through its efficiency and effectiveness on 3924 files gathered from three P2P networks. ResultsP2P Watch successfully processed 3924 P2P files of unknown content. A manual examination of 1578 randomly selected files marked by the system as non-PHI confirmed that these files indeed did not contain PHI, making the false-negative detection rate equal to zero. Of 57 files marked by the system as PHI, all contained both personally identifiable information and health information: 11 files were PHI disclosures, and 46 files contained organizational materials such as unfilled insurance forms, job applications by medical professionals, and essays. ConclusionsPHI can be successfully detected in free-form textual files exchanged through P2P networks. Once the files with PHI are detected, affected individuals or data custodians can be alerted to take remedial action.
url http://www.jmir.org/2012/4/e95/
work_keys_str_mv AT sokolovamarina p2pwatchpersonalhealthinformationdetectioninpeertopeerfilesharingnetworks
AT elemamkhaled p2pwatchpersonalhealthinformationdetectioninpeertopeerfilesharingnetworks
AT arbuckleluk p2pwatchpersonalhealthinformationdetectioninpeertopeerfilesharingnetworks
AT neriemilio p2pwatchpersonalhealthinformationdetectioninpeertopeerfilesharingnetworks
AT rosesean p2pwatchpersonalhealthinformationdetectioninpeertopeerfilesharingnetworks
AT jonkerelizabeth p2pwatchpersonalhealthinformationdetectioninpeertopeerfilesharingnetworks
_version_ 1721548023988748288