LGNov 15, 2025

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach, Dung Nguyen, Thao Minh Le, Truyen Tran

arXiv:2511.12155v14.1h-index: 9

Originality Highly original

AI Analysis

This addresses a fundamental limitation in safety alignment methodologies for large language models, offering practical solutions to improve adversarial robustness.

The paper tackled the problem of systematic vulnerabilities in large language models to adversarial attacks despite safety alignment, revealing that incomplete safety learning due to position-dependent gradient weakening leads to undertrained regions, and introduced a targeted completion method that reduced attack success rates by 48-98% while preserving general capabilities.

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

View on arXiv PDF

Similar