PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues
This addresses the barrier of PII anonymization for researchers sharing educational dialogue data, though it appears incremental as it builds on existing PII identification methods.
The authors tackled the problem of PII anonymization for open-science data sharing by developing PIIvot, a lightweight framework that uses data context to simplify detection, and contributed QATD-2k, the largest open-source real-world tutoring dataset.
Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.