CLAICYHCMay 29

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

arXiv:2605.3091354.0
AI Analysis

This research highlights a critical reliability issue for LLMs in conversational settings, where user toxicity can degrade factual accuracy, impacting users who rely on LLMs for information.

This paper investigates how toxic language in prompts affects the factual reliability of large language models (LLMs). They found that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty across five LLMs on benchmarks like ARC-Easy, GSM8K, and MMLU, while polite phrasing had limited impact.

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes