Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction
This addresses the problem of cross-model safety alignment for AI developers, offering a method to transfer refusal interventions without target-side supervision, though it is incremental as it builds on existing concepts of semantic circuits.
The paper tackled the problem of transferring refusal behavior interventions across different LLMs by hypothesizing a universal low-dimensional semantic circuit, and the result was that their framework successfully attenuated refusal in 8 model pairs while maintaining performance, providing evidence for semantic universality.
Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.