LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
This work addresses a critical failure mode in long-context LLM generation, offering a practical solution to improve reliability and output quality.
The paper identifies a failure mode in long-context generation where decoding collapses into repetition loops due to collapsed attention patterns and KV cache reuse. It introduces LoopBench for benchmarking and LoopGuard, a KV cache intervention that reduces loop incidence by over 90 percentage points while improving output diversity.
Through systematic experiments on long-context generation, we observe a damaging failure mode in which decoding can collapse into persistent repetition loops. We find that this degeneration is driven by collapsed attention patterns, where a subset of heads locks onto a narrow suffix of the history, and is further stabilized by inference-time KV cache reuse. Crucially, since many existing KV cache policies rely on attention-based importance, this collapse can produce spuriously high scores for repetitive tokens, causing cache management to inadvertently amplify repetition. To study this phenomenon in a controlled and reproducible manner, we introduce LoopBench, a benchmark with explicit loop-inducing conditions and loop-oriented metrics that quantify repetition severity and generation instability beyond downstream task scores. Building on these insights, we propose LoopGuard, a lightweight, plug-in KV cache guard that detects loop onset online and disrupts the feedback cycle by pruning repetitive tail spans under a fixed cache budget. Experiments on LoopBench show that LoopGuard reduces loop incidence by over 90 percentage points, while restoring output diversity and reducing token waste.