Learning to Poison Large Language Models for Downstream Manipulation
This work addresses security risks for users of LLMs by exposing and mitigating data poisoning vulnerabilities, though it is incremental as it builds on existing attack methods.
The paper tackles the vulnerability of Large Language Models to data poisoning attacks during supervised fine-tuning, proposing a gradient-guided backdoor trigger learning algorithm that achieves a high success rate in compromising outputs across tasks like sentiment analysis and question answering, while also introducing defense strategies that significantly reduce performance decline.
The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where the adversary inserts backdoor triggers into training data to manipulate outputs. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the supervised fine-tuning (SFT) process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various language model tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during SFT of LLMs and the necessity of safeguarding LLMs against data poisoning attacks.