AIOct 8, 2025

An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Harshit Rajgarhia, Suryam Gupta, Asif Shaik, Gulipalli Praveen Kumar, Y Santhoshraj, Sanka Nithya Tanvy Nishitha, Abhishek Mukherji

arXiv:2510.07551v15.82 citationsh-index: 1

Originality Incremental advance

AI Analysis

This provides a scalable solution for privacy compliance in low-resource language applications, though it is incremental as it builds on existing methods.

The paper tackled the problem of detecting Personally Identifiable Information (PII) in low-resource languages by developing RECAP, a hybrid framework combining regular expressions with large language models. The result showed that RECAP outperformed fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score across 13 locales.

The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.

View on arXiv PDF

Similar