LGAIOct 20, 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

arXiv:2510.18081v11 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This addresses a critical safety vulnerability in LLMs for users and developers, offering a defense against adversarial attacks without model retraining, though it is incremental as it builds on existing alignment mechanisms.

The paper tackles the problem of shallow safety alignment in Large Language Models (LLMs), where protection collapses during harmful continuations, and proposes Any-Depth Alignment (ADA) to unlock innate alignment for robust safety at any generation depth, achieving near-100% refusal rates against adversarial attacks and reducing attack success rates to below 3%.

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes