IRCRApr 23, 2020

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

arXiv:2004.11131v2714 citations
AI Analysis

This work addresses the problem of analyzing and simplifying privacy policies for users and researchers, but it is incremental as it primarily provides a new dataset without novel methodological breakthroughs.

The authors tackled the lack of large-scale privacy policy corpora for NLP analysis by creating PrivaSeer, a corpus of over one million English privacy policies, which is significantly larger than previous datasets, and they investigated its composition through readability tests and topic modeling.

Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes