CLMay 22, 2025

PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues

arXiv:2505.16931v16 citationsh-index: 6Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses the barrier of PII anonymization for researchers sharing educational dialogue data, though it appears incremental as it builds on existing PII identification methods.

The authors tackled the problem of PII anonymization for open-science data sharing by developing PIIvot, a lightweight framework that uses data context to simplify detection, and contributed QATD-2k, the largest open-source real-world tutoring dataset.

Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes