CRAIFeb 21, 2023

BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT

arXiv:2304.12298v1108 citationsh-index: 35
Originality Highly original
AI Analysis

This addresses security vulnerabilities in widely used AI models like ChatGPT, which is an incremental but important step for AI safety.

The authors tackled the security of ChatGPT by proposing BadGPT, the first backdoor attack against RL fine-tuning in language models, showing that an attacker can manipulate generated text on the IMDB dataset.

Recently, ChatGPT has gained significant attention in research due to its ability to interact with humans effectively. The core idea behind this model is reinforcement learning (RL) fine-tuning, a new paradigm that allows language models to align with human preferences, i.e., InstructGPT. In this study, we propose BadGPT, the first backdoor attack against RL fine-tuning in language models. By injecting a backdoor into the reward model, the language model can be compromised during the fine-tuning stage. Our initial experiments on movie reviews, i.e., IMDB, demonstrate that an attacker can manipulate the generated text through BadGPT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes