Evading Toxicity Detection with ASCII-art: A Benchmark of Spatial Attacks on Moderation Systems
This reveals a critical security flaw in text moderation systems, impacting online safety and content filtering.
The paper tackles the problem of toxicity detection models being vulnerable to adversarial attacks using ASCII art, achieving a perfect Attack Success Rate across state-of-the-art models and moderation tools.
We introduce a novel class of adversarial attacks on toxicity detection models that exploit language models' failure to interpret spatially structured text in the form of ASCII art. To evaluate the effectiveness of these attacks, we propose ToxASCII, a benchmark designed to assess the robustness of toxicity detection systems against visually obfuscated inputs. Our attacks achieve a perfect Attack Success Rate (ASR) across a diverse set of state-of-the-art large language models and dedicated moderation tools, revealing a significant vulnerability in current text-only moderation systems.