CLOct 10, 2025

Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models

arXiv:2510.09004v11 citationsh-index: 26
Originality Highly original
AI Analysis

This addresses the problem of balancing safety and performance for AI developers, offering a more efficient solution compared to current computationally expensive methods.

The paper tackles the challenge of enhancing safety alignment in large language models without degrading general performance, showing that LoRA-based refusal training enables performance-preserving safety alignment using only safety data, with LoRA serving as a cost-efficient and plug-and-play safety patch.

Safety alignment is essential for building trustworthy artificial intelligence, yet it remains challenging to enhance model safety without degrading general performance. Current approaches require computationally expensive searches for the optimal proportion of safety-critical and general-purpose data to balance safety and general performance, incurring high costs with limited gains. In this work, we show that LoRA-based Refusal-training enables performance-preserving safety alignment even when trained solely on safety data, demonstrating that LoRA serves as cost-efficient, performance-preserving, and plug-and-play safety patches. Beyond empirical findings, we provide both theoretical and experimental evidence that LoRA effectively decouples safety into a low-rank subspace largely orthogonal to the model's intrinsic transformation space, ensuring that safety enhancements do not interfere with inherent capabilities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes