CLJan 22

Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

arXiv:2601.16034v2h-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of cross-model safety alignment for AI developers, offering a method to transfer refusal interventions without target-side supervision, though it is incremental as it builds on existing concepts of semantic circuits.

The paper tackled the problem of transferring refusal behavior interventions across different LLMs by hypothesizing a universal low-dimensional semantic circuit, and the result was that their framework successfully attenuated refusal in 8 model pairs while maintaining performance, providing evidence for semantic universality.

Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes