Is Fiddler superseded?

Fiddler (Mixture-of-experts routing): heavily superseded — a standard baseline that newer methods routinely beat. 7 paper(s) critique it, 2 beat it on benchmarks — #5 of 1370 most-superseded. Sub-problem: cluster led by MC-SMoE. Newer alternatives in the same sub-problem include Less is MoE, TIDE, CoX-MoE, HodgeCover, dynamic expert replication strategy.

Method Drift›Mixture-of-experts routing

Heavily superseded#5 of 1,370 most-superseded

Fiddler

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

Mixture-of-experts routing · first seen Feb 10, 2024

heavily superseded — a standard baseline that newer methods routinely beat

7 papers critique it · 2 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Fiddler as a baseline.

“Fiddler fiddler dynamically places experts across CPU and GPU, yet lacks precise scheduling for hot experts (those processing more tokens).”
— PreScope: Unleashing the Power of Prefetching for Resource-Constrained MoE Inference
“For example, Fiddler uses fixed mapping based on expert activation frequency for CPU-GPU scheduling, which fails to adapt to changing loads.”
— HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
“Additionally, approaches such as Fiddler leverage CPU for additional compute power, but do not fully explore the characteristics of MoE models.”
— MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
“Fiddler reduces PCIe traffic by executing some expert computation on the CPU, but its gains are contingent on CPU capability and diminish as per-expert token counts grow, where CPU execution becomes slow and weight transfers to GPU become preferable.”
— MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
“While Fiddler and DAOP aim for DRAM-offloading-based inference, their CPU-based computation cannot be fully utilized due to memory bottlenecks.”
— FlashMoE: Reducing SSD I/O Bottlenecks via ML-Based Cache Replacement for Mixture-of-Experts Inference on Edge Devices
“we use Fiddler as a CPU computation baseline, where expert placements remain static during decoding”
— TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
“Heterogeneous strategies like Fiddler offload certain computations to the CPU but encounter compute-bound bottlenecks during dequantization, leading to latency penalties that outweigh transmission savings”
— DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Beaten on benchmarks

Head-to-head results where a newer method reports beating Fiddler. Values are copied from the source paper's tables — verify against the cited paper.

MELINOE (Fine-Tune: Dolly15K) beats Fiddler · throughput [Eval: Dolly15K, Phi-3.5-MoE]
14.34 vs 5.88
MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
MELINOE (Fine-Tune: Dolly15K) beats Fiddler · throughput [Eval: Dolly15K, Mixtral-8x7B]
9.35 vs 5.24
MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
MELINOE (Fine-Tune: GSM8K) beats Fiddler · throughput [Eval: GSM8K, Phi-3.5-MoE]
15.67 vs 7.26
MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
MELINOE (Fine-Tune: GSM8K) beats Fiddler · throughput [Eval: GSM8K, Mixtral-8x7B]
10.38 vs 4.11
MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 256 GPU Expert Budget 64 GPU Memory Constraint 10GB]
2.11 vs 1.81
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 256 GPU Expert Budget 128 GPU Memory Constraint 18GB]
2.36 vs 1.74
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 1024 GPU Expert Budget 64 GPU Memory Constraint 10GB]
1.89 vs 1.79
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-mini Gen Length 1024 GPU Expert Budget 128 GPU Memory Constraint 18GB]
2.44 vs 1.80
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 256 GPU Expert Budget 32 GPU Memory Constraint 30GB]
1.25 vs 0.95
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 256 GPU Expert Budget 64 GPU Memory Constraint 55GB]
1.73 vs 1.14
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 1024 GPU Expert Budget 32 GPU Memory Constraint 30GB]
1.24 vs 0.89
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
TIDE beats Fiddler · Throughput (token/s) [LLaDA2.0-flash Gen Length 1024 GPU Expert Budget 64 GPU Memory Constraint 55GB]
1.45 vs 1.05
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.