Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)
This reveals vulnerabilities in current LLMs, emphasizing the need for stronger safeguards in responsible AI, but it is incremental as it builds on an existing multi-turn attack method.
The paper tackles the problem of adversarial attacks on large language models by introducing the Single-Turn Crescendo Attack (STCA), which provokes harmful responses in a single interaction, bypassing typical moderation filters.
This paper introduces a new method for adversarial attacks on large language models (LLMs) called the Single-Turn Crescendo Attack (STCA). Building on the multi-turn crescendo attack method introduced by Russinovich, Salem, and Eldan (2024), which gradually escalates the context to provoke harmful responses, the STCA achieves similar outcomes in a single interaction. By condensing the escalation into a single, well-crafted prompt, the STCA bypasses typical moderation filters that LLMs use to prevent inappropriate outputs. This technique reveals vulnerabilities in current LLMs and emphasizes the importance of stronger safeguards in responsible AI (RAI). The STCA offers a novel method that has not been previously explored.