Automated Extraction and Presentation of Data Practices in Privacy Policies

Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated ana...

Full description

Bibliographic Details
Main Authors: Bui Duc, Shin Kang G., Choi Jong-Min, Shin Junbum
Format: Article
Language:English
Published: Sciendo 2021-04-01
Series:Proceedings on Privacy Enhancing Technologies
Subjects:
Online Access:https://doi.org/10.2478/popets-2021-0019
id doaj-30ed648299104698ac34c4e490f1c170
record_format Article
spelling doaj-30ed648299104698ac34c4e490f1c1702021-09-05T14:01:11ZengSciendoProceedings on Privacy Enhancing Technologies2299-09842021-04-01202128811010.2478/popets-2021-0019Automated Extraction and Presentation of Data Practices in Privacy PoliciesBui Duc0Shin Kang G.1Choi Jong-Min2Shin Junbum3University of MichiganUniversity of MichiganSamsung ResearchSamsung ResearchPrivacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.https://doi.org/10.2478/popets-2021-0019privacy policydatasetpresentationannotationuser studyusability
collection DOAJ
language English
format Article
sources DOAJ
author Bui Duc
Shin Kang G.
Choi Jong-Min
Shin Junbum
spellingShingle Bui Duc
Shin Kang G.
Choi Jong-Min
Shin Junbum
Automated Extraction and Presentation of Data Practices in Privacy Policies
Proceedings on Privacy Enhancing Technologies
privacy policy
dataset
presentation
annotation
user study
usability
author_facet Bui Duc
Shin Kang G.
Choi Jong-Min
Shin Junbum
author_sort Bui Duc
title Automated Extraction and Presentation of Data Practices in Privacy Policies
title_short Automated Extraction and Presentation of Data Practices in Privacy Policies
title_full Automated Extraction and Presentation of Data Practices in Privacy Policies
title_fullStr Automated Extraction and Presentation of Data Practices in Privacy Policies
title_full_unstemmed Automated Extraction and Presentation of Data Practices in Privacy Policies
title_sort automated extraction and presentation of data practices in privacy policies
publisher Sciendo
series Proceedings on Privacy Enhancing Technologies
issn 2299-0984
publishDate 2021-04-01
description Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.
topic privacy policy
dataset
presentation
annotation
user study
usability
url https://doi.org/10.2478/popets-2021-0019
work_keys_str_mv AT buiduc automatedextractionandpresentationofdatapracticesinprivacypolicies
AT shinkangg automatedextractionandpresentationofdatapracticesinprivacypolicies
AT choijongmin automatedextractionandpresentationofdatapracticesinprivacypolicies
AT shinjunbum automatedextractionandpresentationofdatapracticesinprivacypolicies
_version_ 1717810585357254656