CLLGApr 9, 2024

Extractive text summarisation of Privacy Policy documents using machine learning approaches

arXiv:2404.08686v1
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of efficiently extracting essential sentences for GDPR compliance in privacy policies, but it is incremental as it builds on existing clustering methods.

The paper tackled summarizing Privacy Policy documents by developing two extractive models using K-means and Pre-determined Centroid clustering, with the PDC model outperforming K-means by 27% in SSD and 24% in ROUGE scores.

This work demonstrates two Privacy Policy (PP) summarisation models based on two different clustering algorithms: K-means clustering and Pre-determined Centroid (PDC) clustering. K-means is decided to be used for the first model after an extensive evaluation of ten commonly used clustering algorithms. The summariser model based on the PDC-clustering algorithm summarises PP documents by segregating individual sentences by Euclidean distance from each sentence to the pre-defined cluster centres. The cluster centres are defined according to General Data Protection Regulation (GDPR)'s 14 essential topics that must be included in any privacy notices. The PDC model outperformed the K-means model for two evaluation methods, Sum of Squared Distance (SSD) and ROUGE by some margin (27% and 24% respectively). This result contrasts the K-means model's better performance in the general clustering of sentence vectors before running the task-specific evaluation. This indicates the effectiveness of operating task-specific fine-tuning measures on unsupervised machine-learning models. The summarisation mechanisms implemented in this paper demonstrates an idea of how to efficiently extract essential sentences that should be included in any PP documents. The summariser models could be further developed to an application that tests the GDPR-compliance (or any data privacy legislation) of PP documents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes