AI CRJun 17, 2025

Doppelganger Method: Breaking Role Consistency in LLM Agent via Prompt-based Transferable Adversarial Attack

Daewon Kang, YeongHwan Shin, Doyeon Kim, Kyu-Hwan Jung, Meong Hi Son

arXiv:2506.14539v23.3

Originality Incremental advance

AI Analysis

This addresses safety and robustness concerns for users of LLM-based autonomous agents, though it is incremental as it builds on existing adversarial attack research.

The paper tackles the vulnerability of LLM agents to adversarial attacks that hijack them to expose system instructions and internal information, proposing the Doppelganger method which successfully compromises agent consistency and a defense via CAT prompts that effectively counters the attack.

Since the advent of large language models, prompt engineering now enables the rapid, low-effort creation of diverse autonomous agents that are already in widespread use. Yet this convenience raises urgent concerns about the safety, robustness, and behavioral consistency of the underlying prompts, along with the pressing challenge of preventing those prompts from being exposed to user's attempts. In this paper, we propose the ''Doppelganger method'' to demonstrate the risk of an agent being hijacked, thereby exposing system instructions and internal information. Next, we define the ''Prompt Alignment Collapse under Adversarial Transfer (PACAT)'' level to evaluate the vulnerability to this adversarial transfer attack. We also propose a ''Caution for Adversarial Transfer (CAT)'' prompt to counter the Doppelganger method. The experimental results demonstrate that the Doppelganger method can compromise the agent's consistency and expose its internal information. In contrast, CAT prompts enable effective defense against this adversarial attack.

View on arXiv PDF

Similar