From Automation to Collaboration: Human-in-the-Loop Methods for Safe and Trustworthy NLP
For NLP practitioners and researchers, this survey identifies critical gaps and future directions in human-in-the-loop methods for safety and trustworthiness, but it is a review without novel experimental results.
This survey examines human-in-the-loop methods for safe and trustworthy NLP, highlighting gaps in scalable probing, sustainable robustness benchmarks, low-resource settings, and governance of private systems, and outlines research directions for adaptive auditing, collaborative evaluation, and accountable deployment.
Large language models are widely deployed in high-stakes NLP tasks, yet risks such as bias, hallucination, adversarial vulnerability and unreliable generalization remain. Probe-based auditing reveals inconsistencies in model behavior. Adversarial text generation uncovers robustness gaps, especially in lower-resourced languages with limited benchmarks. Enterprise text-to-SQL settings expose the difficulty of validating outputs over private and large-scale databases. Human supervision is essential for probe validation, adversarial verification and domain-specific annotation, but it is costly and hard to scale. This survey examines recent human-in-the-loop methods that shift NLP from automation toward collaboration for safety and trustworthiness. We review how human expertise supports auditing, robustness evaluation, data construction and model steering. Our findings highlight gaps in scalable probing, sustainable robustness benchmarks, low-resource settings and governance of private systems. We outline practical research directions for adaptive auditing, collaborative evaluation and accountable deployment.