LG AI CL CROct 16, 2025

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers

Andrew Zhao, Reshmi Ghosh, Vitor Carvalho, Emily Lawton, Keegan Hines, Gao Huang, Jack W. Stokes

Tsinghua

arXiv:2510.14381v13 citationsh-index: 10

Originality Highly original

AI Analysis

This addresses security risks for users of LLM systems like chatbots and autonomous robots, establishing prompt optimization as a critical attack surface.

The paper tackles the security vulnerabilities in LLM-based prompt optimizers, finding that feedback-based attacks increase attack success rates by up to 0.48, and proposes a defense that reduces vulnerability from 0.23 to 0.07.

Large language model (LLM) systems now underpin everyday AI applications such as chatbots, computer-use assistants, and autonomous robots, where performance often depends on carefully designed prompts. LLM-based prompt optimizers reduce that effort by iteratively refining prompts from scored feedback, yet the security of this optimization stage remains underexamined. We present the first systematic analysis of poisoning risks in LLM-based prompt optimization. Using HarmBench, we find systems are substantially more vulnerable to manipulated feedback than to injected queries: feedback-based attacks raise attack success rate (ASR) by up to $Δ$ASR = 0.48. We introduce a simple fake-reward attack that requires no access to the reward model and significantly increases vulnerability, and we propose a lightweight highlighting defense that reduces the fake-reward $Δ$ASR from 0.23 to 0.07 without degrading utility. These results establish prompt optimization pipelines as a first-class attack surface and motivate stronger safeguards for feedback channels and optimization frameworks.

View on arXiv PDF

Similar