CL LGJun 2

Expert-Aware Refusal Steering

Anna C. Marbut, Daniel R. Olson, Travis J. Wheeler

arXiv:2606.0416094.2Has Code

Predicted impact top 15% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on safety alignment in MoE LLMs, this work provides novel steering methods that reveal the role of attention in refusal behavior, though the improvements over existing methods are incremental.

The paper extends refusal steering methods to Mixture-of-Experts (MoE) LLMs, showing that steering performance is unaffected by MoE routing complexity. It proposes expert-aware methods that leverage refusal-specific expert routing and steering directions, achieving effective refusal suppression using a single expert.

Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.

View on arXiv PDF

Similar