AISep 9, 2025

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

arXiv:2509.07617v12 citationsh-index: 11EMNLP
Originality Incremental advance
AI Analysis

This addresses security vulnerabilities in LLMs for users and developers, though it is incremental as it builds on existing attack methods.

The paper tackled the problem of Direct Prompt Injection attacks on Large Language Models by proposing an activations-guided framework using an Energy-based Model and MCMC sampling, achieving a 49.6% attack success rate across five LLMs and a 34.6% improvement over human-crafted prompts.

Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes