CLJan 22

Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

arXiv:2601.16034v2h-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of cross-model safety alignment for AI developers, offering a method to transfer refusal interventions without target-side supervision, though it is incremental as it builds on existing concepts of semantic circuits.

The paper tackled the problem of transferring refusal behavior interventions across different LLMs by hypothesizing a universal low-dimensional semantic circuit, and the result was that their framework successfully attenuated refusal in 8 model pairs while maintaining performance, providing evidence for semantic universality.

Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.

View on arXiv PDF

Similar