TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen

arXiv:2603.1246562.71 citations

AI Analysis

This work addresses the challenge of optimizing latency in LLM inference for interactive systems by providing a diagnostic tool to target software stack or device-side improvements, though it is incremental as it builds on existing overhead analysis methods.

The paper tackles the problem of identifying dominant overheads in LLM inference by introducing TaxBreak, a trace-driven methodology that decomposes host-side orchestration overhead into three components, and shows that for host-bound workloads like MoE models, a faster CPU can reduce orchestration overhead by 10-29% and improve latency by up to 14%.

Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.

View on arXiv PDF

Similar