Method Drift›Mixture-of-experts routing
SteerMoE
Steering MoE LLMs via Expert (De)ActivationMixture-of-experts routing · first seen Sep 11, 2025
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 3 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites SteerMoE as a baseline.
“However, these approaches rely on observational analysis rather than proactive search: they depend on predefined unsafe/jailbreak datasets and are therefore constrained by the coverage of those sets. As a result, they typically reveal only modest shifts in harmful outputs while requiring prior data.”
— Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs“It relies on a frequency-based analysis, assigning a Risk Difference (RD) score to each expert based on activation rate differences between prompt sets representing faithful and unfaithful responses.”
— MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks“SteerMoE suppresses unsafe experts at inference time by modifying routing logits, but does not update expert parameters or repair unsafe representations.”
— RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
Beaten on benchmarks
Head-to-head results where a newer method reports beating SteerMoE. Values are copied from the source paper's tables — verify against the cited paper.
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [DeepSeek-MoE-16B-Chat]
84.9 vs 59.7
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [GPT-OSS-20B]
87.9 vs 61.3
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [Hunyuan-A13B-Instruct]
80.4 vs 51.4
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [Mixtral-8x7B-Instruct-v0.1]
77.1 vs 66.8
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [Phi-3.5-MoE-Instruct]
80.6 vs 59.0
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [Qwen1.5-MoE-A2.7B-Chat]
87.2 vs 57.9
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [Qwen3-30B-A3B-Instruct-2507]
89.2 vs 52.6
- MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
MASCing beats SteerMoE · Success rate (%) [Average]
83.9 vs 58.4
- Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs
F-SOUR beats SteerMoE · ASR [JailbreakBench]
0.90 vs 0.50
- Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs
F-SOUR beats SteerMoE · ASR [AdvBench]
0.98 vs 0.55
- RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
RASA beats SteerMoE · Harmlessness [OLMoE, FlipAttack]
1.00 vs 0.50
- RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models
RASA beats SteerMoE · Harmlessness [OLMoE, DeepInception]
1.00 vs 0.13
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- PADDPADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student LearningJun 9, 2026
- May 30, 2026
- May 29, 2026
- May 1, 2026
- Apr 30, 2026
- Feb 9, 2026
- SocialNav-MoESocialNav-MoE: A Mixture-of-Experts Vision Language Model for Socially Compliant Navigation with Reinforcement Fine-TuningDec 15, 2025
- OrdMoEOrdMoE: Preference Alignment via Hierarchical Expert Group Ranking in Multimodal Mixture-of-Experts LLMsNov 24, 2025
- router-aware approach to optimize importance sampling weightsTowards Stable and Effective Reinforcement Learning for Mixture-of-ExpertsOct 27, 2025
- Mix- and MoE-DPOMix- and MoE-DPO: A Variational Inference Approach to Direct Preference OptimizationOct 9, 2025