LGAINov 12, 2025

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

arXiv:2511.09105v11 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses vulnerabilities in RLHF/DPO pipelines for LLM deployment, providing tools to evaluate robustness against low-cost poisoning attacks, but it is incremental as it builds on existing empirical studies with a theoretical focus.

The paper tackles the problem of understanding the theoretical foundations of data poisoning attacks on LLM alignment by investigating the minimum-cost label-flipping attack required to steer an LLM's policy toward an attacker's target, deriving bounds on attack cost and showing that existing attacks can be post-processed to reduce label flips significantly, with empirical results demonstrating cost reductions particularly when the reward model's feature dimension is small relative to dataset size.

Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM's policy toward an attacker's target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model's feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes