LG AIApr 16

The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference

arXiv:2604.1540953.8h-index: 1

Predicted impact top 45% in LG · last 90 daysOriginality Incremental advance

AI Analysis

Reveals a fundamental numerical instability in FP16 KV-cached inference for LLM practitioners, showing that cache-ON and cache-OFF are not equivalent, which has implications for reproducibility and accuracy in deployed systems.

KV caching in FP16 autoregressive inference causes deterministic token divergence from cache-free computation due to FP16 non-associativity, with 100% token divergence across models and sampling strategies, and cache-ON yielding higher accuracy in 8 of 9 conditions.

KV caching is a ubiquitous optimization in autoregressive transformer inference, long presumed to be numerically equivalent to cache-free computation. This assumption fails under standard FP16 precision: cache-ON and cache-OFF execution paths employ different floating-point accumulation orderings which, due to FP16 non-associativity, produce a deterministic divergence in decoded token sequences. Across three open-weight models (LLaMA-2-7B, Mistral-7B-v0.3, Gemma-2-2B) evaluated on GSM8K, we observe a 100\% token divergence rate across all sampling strategies, including greedy decoding, which rules out sampling randomness as a cause, and also with cache-ON yielding higher accuracy in 8 of 9 conditions, where the accuracy difference serves as an indicator that the divergence direction is systematic rather than random. Controlled FP32 falsification reduces divergence by eight orders of magnitude, eliminates token flips, and drops the flip rate to exactly 0.0\%, confirming FP16 non-associativity as the sole causal driver. Layer-wise drift profiling reveals architecturally predictable propagation patterns: models using Grouped-Query Attention exhibit sharp divergence at the first layer, while Gemma's larger head dimension and sliding window attention produce uniform accumulation across all layers. Finally, activation patching of the entire residual stream fails to recover the cache-free trajectory, localizing the causal variable to the stateful KV cache. These findings establish that FP16 KV cache inference is fundamentally non-equivalent to recomputation and provide a mechanistic framework for understanding numerical instability in modern LLM inference systems.

View on arXiv PDF

Similar