Automated Extraction and Presentation of Data Practices in Privacy Policies
Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated ana...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Sciendo
2021-04-01
|
Series: | Proceedings on Privacy Enhancing Technologies |
Subjects: | |
Online Access: | https://doi.org/10.2478/popets-2021-0019 |
id |
doaj-30ed648299104698ac34c4e490f1c170 |
---|---|
record_format |
Article |
spelling |
doaj-30ed648299104698ac34c4e490f1c1702021-09-05T14:01:11ZengSciendoProceedings on Privacy Enhancing Technologies2299-09842021-04-01202128811010.2478/popets-2021-0019Automated Extraction and Presentation of Data Practices in Privacy PoliciesBui Duc0Shin Kang G.1Choi Jong-Min2Shin Junbum3University of MichiganUniversity of MichiganSamsung ResearchSamsung ResearchPrivacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score.https://doi.org/10.2478/popets-2021-0019privacy policydatasetpresentationannotationuser studyusability |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Bui Duc Shin Kang G. Choi Jong-Min Shin Junbum |
spellingShingle |
Bui Duc Shin Kang G. Choi Jong-Min Shin Junbum Automated Extraction and Presentation of Data Practices in Privacy Policies Proceedings on Privacy Enhancing Technologies privacy policy dataset presentation annotation user study usability |
author_facet |
Bui Duc Shin Kang G. Choi Jong-Min Shin Junbum |
author_sort |
Bui Duc |
title |
Automated Extraction and Presentation of Data Practices in Privacy Policies |
title_short |
Automated Extraction and Presentation of Data Practices in Privacy Policies |
title_full |
Automated Extraction and Presentation of Data Practices in Privacy Policies |
title_fullStr |
Automated Extraction and Presentation of Data Practices in Privacy Policies |
title_full_unstemmed |
Automated Extraction and Presentation of Data Practices in Privacy Policies |
title_sort |
automated extraction and presentation of data practices in privacy policies |
publisher |
Sciendo |
series |
Proceedings on Privacy Enhancing Technologies |
issn |
2299-0984 |
publishDate |
2021-04-01 |
description |
Privacy policies are documents required by law and regulations that notify users of the collection, use, and sharing of their personal information on services or applications. While the extraction of personal data objects and their usage thereon is one of the fundamental steps in their automated analysis, it remains challenging due to the complex policy statements written in legal (vague) language. Prior work is limited by small/generated datasets and manually created rules. We formulate the extraction of fine-grained personal data phrases and the corresponding data collection or sharing practices as a sequence-labeling problem that can be solved by an entity-recognition model. We create a large dataset with 4.1k sentences (97k tokens) and 2.6k annotated fine-grained data practices from 30 real-world privacy policies to train and evaluate neural networks. We present a fully automated system, called PI-Extract, which accurately extracts privacy practices by a neural model and outperforms, by a large margin, strong rule-based baselines. We conduct a user study on the effects of data practice annotation which highlights and describes the data practices extracted by PI-Extract to help users better understand privacy-policy documents. Our experimental evaluation results show that the annotation significantly improves the users’ reading comprehension of policy texts, as indicated by a 26.6% increase in the average total reading score. |
topic |
privacy policy dataset presentation annotation user study usability |
url |
https://doi.org/10.2478/popets-2021-0019 |
work_keys_str_mv |
AT buiduc automatedextractionandpresentationofdatapracticesinprivacypolicies AT shinkangg automatedextractionandpresentationofdatapracticesinprivacypolicies AT choijongmin automatedextractionandpresentationofdatapracticesinprivacypolicies AT shinjunbum automatedextractionandpresentationofdatapracticesinprivacypolicies |
_version_ |
1717810585357254656 |