CRJun 4

Steering LLM Viewpoints through Fabricated Evidence Injection

Xi Yang, Chang Liu, Zhenglin Huang, Haoran Li, Weiming Zhang, Jian Weng, Yangqiu Song

arXiv:2606.0624487.5

AI Analysis

This work highlights a critical security vulnerability in LLMs for developers and users relying on trustworthy AI outputs.

The paper introduces Ghostwriter, a two-phase attack that injects fabricated evidence into LLMs to steer their viewpoints, showing that commercial LLMs are highly vulnerable and even guarded models like GPT-5.4 only partially mitigate the attack, with a tailored safety policy achieving 81% detection rate.

As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate.

View on arXiv PDF

Similar