Automated Extraction and Presentation of Data Practices in Privacy Policies

Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated ana...

Full description

Bibliographic Details
Main Authors:	Bui Duc, Shin Kang G., Choi Jong-Min, Shin Junbum
Format:	Article
Language:	English
Published:	Sciendo 2021-04-01
Series:	Proceedings on Privacy Enhancing Technologies
Subjects:	privacy policy dataset presentation annotation user study usability
Online Access:	https://doi.org/10.2478/popets-2021-0019

id	doaj-30ed648299104698ac34c4e490f1c170
record_format	Article
spelling	doaj-30ed648299104698ac34c4e490f1c1702021-09-05T14:01:11ZengSciendoProceedings on Privacy Enhancing Technologies2299-09842021-04-01202128811010.2478/popets-2021-0019Automated Extraction and Presentation of Data Practices in Privacy PoliciesBui Duc0Shin Kang G.1Choi Jong-Min2Shin Junbum3University of MichiganUniversity of MichiganSamsung ResearchSamsung ResearchPrivacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.https://doi.org/10.2478/popets-2021-0019privacy policydatasetpresentationannotationuser studyusability
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Bui Duc Shin Kang G. Choi Jong-Min Shin Junbum
spellingShingle	Bui Duc Shin Kang G. Choi Jong-Min Shin Junbum Automated Extraction and Presentation of Data Practices in Privacy Policies Proceedings on Privacy Enhancing Technologies privacy policy dataset presentation annotation user study usability
author_facet	Bui Duc Shin Kang G. Choi Jong-Min Shin Junbum
author_sort	Bui Duc
title	Automated Extraction and Presentation of Data Practices in Privacy Policies
title_short	Automated Extraction and Presentation of Data Practices in Privacy Policies
title_full	Automated Extraction and Presentation of Data Practices in Privacy Policies
title_fullStr	Automated Extraction and Presentation of Data Practices in Privacy Policies
title_full_unstemmed	Automated Extraction and Presentation of Data Practices in Privacy Policies
title_sort	automated extraction and presentation of data practices in privacy policies
publisher	Sciendo
series	Proceedings on Privacy Enhancing Technologies
issn	2299-0984
publishDate	2021-04-01
description	Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.
topic	privacy policy dataset presentation annotation user study usability
url	https://doi.org/10.2478/popets-2021-0019
work_keys_str_mv	AT buiduc automatedextractionandpresentationofdatapracticesinprivacypolicies AT shinkangg automatedextractionandpresentationofdatapracticesinprivacypolicies AT choijongmin automatedextractionandpresentationofdatapracticesinprivacypolicies AT shinjunbum automatedextractionandpresentationofdatapracticesinprivacypolicies
_version_	1717810585357254656

Automated Extraction and Presentation of Data Practices in Privacy Policies

Similar Items