Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving
This work addresses a specific bottleneck in LLM serving for efficient resource allocation, representing an incremental improvement in system optimization.
The paper tackles the problem of sizing Attention and FFN resources in disaggregated LLM serving to avoid blocking and idle time, deriving closed-form rules for the optimal ratio that reduces idle time and matches simulation results within 10%.
Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.