CR AI CL LG SYSep 30, 2024

Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges

Qin Liu, Wenjie Mo, Terry Tong, Jiashu Xu, Fei Wang, Chaowei Xiao, Muhao Chen

Harvard

arXiv:2409.19993v115.420 citationsh-index: 45

Originality Synthesis-oriented

AI Analysis

This addresses cybersecurity vulnerabilities in LLMs, which are critical for domains like Web search and healthcare, but the work is incremental as it primarily reviews existing research.

The paper surveys backdoor threats to Large Language Models (LLMs), where adversaries inject malicious behaviors via manipulated training data, and covers recent advancements in defense and detection strategies to mitigate these risks.

The advancement of Large Language Models (LLMs) has significantly impacted various domains, including Web search, healthcare, and software development. However, as these models scale, they become more vulnerable to cybersecurity risks, particularly backdoor attacks. By exploiting the potent memorization capacity of LLMs, adversaries can easily inject backdoors into LLMs by manipulating a small portion of training data, leading to malicious behaviors in downstream applications whenever the hidden backdoor is activated by the pre-defined triggers. Moreover, emerging learning paradigms like instruction tuning and reinforcement learning from human feedback (RLHF) exacerbate these risks as they rely heavily on crowdsourced data and human feedback, which are not fully controlled. In this paper, we present a comprehensive survey of emerging backdoor threats to LLMs that appear during LLM development or inference, and cover recent advancement in both defense and detection strategies for mitigating backdoor threats to LLMs. We also outline key challenges in addressing these threats, highlighting areas for future research.

View on arXiv PDF

Similar