SPDCApr 28

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

arXiv:2604.2577711.3
Predicted impact top 7% in SP · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the critical communication bottleneck in federated LLM inference for edge computing, enabling practical deployment by reducing per-token transmission costs.

SpecFed accelerates federated LLM inference by integrating speculative decoding for parallel processing and a top-K compressed transmission scheme to reduce communication overhead, achieving high generation fidelity with significantly lower latency.

Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce speculative decoding to enable parallel LLM processing and propose a top-K compressed transmission scheme with two server-side reconstruction strategies. We theoretically analyze the robustness of our method in terms of local reconstruction error, aggregation bias, and acceptance-rate bias, and derive corresponding bounds. Experiments demonstrate that our scheme achieves high generation fidelity while significantly reducing communication overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes