LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
This exposes a critical security flaw in AI systems that rely on user feedback, potentially affecting all users of such models by enabling unauthorized knowledge injection.
The paper identifies a vulnerability in language models trained with user feedback, where a single user can persistently alter model knowledge and behavior by manipulating feedback on outputs, resulting in increased probability of poisoned responses even without malicious prompts.
We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).