De-identification of Privacy-related Entities in Job Postings
This work addresses privacy-preserving data handling in job postings, an incremental extension of de-identification from medical to new domains.
The paper tackled de-identification of privacy-related entities in job postings by creating the JobStack corpus and experimenting with models like LSTM, Transformers, and BERT, finding that auxiliary data improves performance and vanilla BERT outperformed a domain-specific BERT model.
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. It has been well-studied within the medical domain. The need for de-identification technology is increasing, as privacy-preserving data handling is in high demand in many domains. In this paper, we focus on job postings. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow. We introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer models. To improve upon these baselines, we experiment with contextualized embeddings and distantly related auxiliary data via multi-task learning. Our results show that auxiliary data improves de-identification performance. Surprisingly, vanilla BERT turned out to be more effective than a BERT model trained on other portions of Stackoverflow.