CLCYJun 18, 2024

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

arXiv:2406.12223v135 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of robust offensive language detection for Chinese digital platforms, highlighting incremental improvements in evaluating model vulnerabilities.

The study tackled the problem of detecting offensive language in Chinese by testing state-of-the-art large language models on a perturbed dataset, revealing that models significantly underperform when faced with homophonic substitutions and emoji transformations.

Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus on Chinese, a language particularly susceptible to such perturbations. We introduce \textsf{ToxiCloakCN}, an enhanced dataset derived from ToxiCN, augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations. Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language detection to combat the evolving tactics used to evade detection mechanisms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes