AICRMay 22

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

arXiv:2605.242702.1
Predicted impact top 99% in AI · last 90 daysOriginality Synthesis-oriented
AI Analysis

For AI safety researchers, this provides an empirical analysis of how router behavior in MoE models relates to safety, though findings are incremental and specific to Mixtral.

This paper analyzes routing behavior in Mixtral 8x7B-Instruct under benign and harmful prompts, finding that safety-relevant routing is subtle, depth-dependent, and distributed. Expert-suppression interventions reduced restricted responses from 24 to 14 (activation-based) and from 34 to 22 (gradient-based).

Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers. Expert classification shows most experts are shared across benign and harmful prompts, though a limited subset shows clear group preference. Top-ranked expert sets show stronger benign-malicious overlap under gradient scores than activation scores, suggesting concentration on a common late-layer expert set. In intervention experiments, suppressing top five benign-dominant experts from activation scores reduces restricted responses from 24 to 14 over 100 prompts, while suppressing gradient-derived experts reduces them from 34 to 22 with fewer unintended reversals. Overall, safety-relevant routing in Mixtral is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes