Method Drift›Mixture-of-experts routing
Fiddler
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts ModelsMixture-of-experts routing · first seen Feb 10, 2024
heavily superseded — a standard baseline that newer methods routinely beat
7 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Fiddler as a baseline.
“Fiddler fiddler dynamically places experts across CPU and GPU, yet lacks precise scheduling for hot experts (those processing more tokens).”
— PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference“For example, Fiddler uses fixed mapping based on expert activation frequency for CPU-GPU scheduling, which fails to adapt to changing loads.”
— HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference“Additionally, approaches such as Fiddler leverage CPU for additional compute power, but do not fully explore the characteristics of MoE models.”
— MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache“Fiddler reduces PCIe traffic by executing some expert computation on the CPU, but its gains are contingent on CPU capability and diminish as per-expert token counts grow, where CPU execution becomes slow and weight transfers to GPU become preferable.”
— MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models“While Fiddler and DAOP aim for DRAM-offloading-based inference, their CPU-based computation cannot be fully utilized due to memory bottlenecks.”
— FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices“we use Fiddler as a CPU computation baseline, where expert placements remain static during decoding”
— TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload“Heterogeneous strategies like Fiddler offload certain computations to the CPU but encounter compute-bound bottlenecks during dequantization, leading to latency penalties that outweigh transmission savings”
— DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge
Beaten on benchmarks
Head-to-head results where a newer method reports beating Fiddler. Values are copied from the source paper's tables — verify against the cited paper.
- MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
MELINOE (Fine-Tune: Dolly15K) beats Fiddler · throughput [Eval: Dolly15K, Phi-3.5-MoE]
14.34 vs 5.88
- MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
MELINOE (Fine-Tune: Dolly15K) beats Fiddler · throughput [Eval: Dolly15K, Mixtral-8x7B]
9.35 vs 5.24
- MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
MELINOE (Fine-Tune: GSM8K) beats Fiddler · throughput [Eval: GSM8K, Phi-3.5-MoE]
15.67 vs 7.26
- MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
MELINOE (Fine-Tune: GSM8K) beats Fiddler · throughput [Eval: GSM8K, Mixtral-8x7B]
10.38 vs 4.11
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 256 GPU Expert Budget 64 GPU Memory Constraint 10GB]
2.11 vs 1.81
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 256 GPU Expert Budget 128 GPU Memory Constraint 18GB]
2.36 vs 1.74
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 1024 GPU Expert Budget 64 GPU Memory Constraint 10GB]
1.89 vs 1.79
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 1024 GPU Expert Budget 128 GPU Memory Constraint 18GB]
2.44 vs 1.80
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 256 GPU Expert Budget 32 GPU Memory Constraint 30GB]
1.25 vs 0.95
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 256 GPU Expert Budget 64 GPU Memory Constraint 55GB]
1.73 vs 1.14
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 1024 GPU Expert Budget 32 GPU Memory Constraint 30GB]
1.24 vs 0.89
- TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 1024 GPU Expert Budget 64 GPU Memory Constraint 55GB]
1.45 vs 1.05
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 4, 2026
- May 19, 2026
- CoX-MoECoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-ExecutionMay 18, 2026
- HodgeCoverHodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-ExpertsMay 13, 2026
- dynamic expert replication strategyFast MoE Inference via Predictive Prefetching and Expert ReplicationMay 12, 2026
- Apr 22, 2026
- Apr 12, 2026
- Alloc-MoEAlloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts InferenceApr 9, 2026
- Mar 19, 2026
- Mar 13, 2026
- Mar 12, 2026
- Mar 6, 2026