CLMar 3

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

arXiv:2603.03081v12 citationsh-index: 15
Originality Incremental advance
AI Analysis

This addresses the vulnerability of LLMs to safety bypasses, offering an incremental improvement over existing optimization-based jailbreak methods.

The paper tackles the problem of jailbreak attacks on large language models by proposing TAO-Attack, an optimization-based method that uses a two-stage loss function and direction-priority token optimization to reduce refusals and pseudo-harmful outputs, achieving higher attack success rates, including 100% in some cases.

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes