The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence
This addresses a critical reliability issue for LLMs in decision-making applications, though it is incremental as it builds on existing concerns about misinformation.
The paper tackles the problem of LLMs being vulnerable to sophisticated deceptive evidence, revealing that while models resist direct misinformation, belief in falsehoods increases by 93.0% when exposed to refined deceptive claims, and proposes a mitigation method that reduces this susceptibility.
To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0\%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.