LoopQ: Quantization for Recursive Transformers
Enables efficient quantization of recursive Transformer models for parameter-efficient language modeling, addressing a previously unstudied problem.
LoopQ addresses the fragility of looped language models under post-training quantization, achieving a 68.8% improvement in downstream accuracy and 87.7% reduction in perplexity under W4A4 quantization compared to the strongest static PTQ baseline.
Looped language models (LoopLMs) improve parameter efficiency by recursively reusing Transformer blocks, enabling deeper computation under a fixed model size. However, this reuse makes LoopLMs more fragile under post-training quantization (PTQ). We present the first systematic study of quantization in LoopLMs and identify three challenges: distribution shift across roles, state reuse across loop transitions, and recursive error accumulation. To address these challenges, we propose LoopQ, a loop-aware PTQ framework that preserves a shared quantized backbone while introducing lightweight adaptations. LoopQ combines activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization to reduce distributional mismatch within loops and error accumulation across loops. Experiments across seven benchmarks show that, under W4A4 quantization, LoopQ improves average downstream accuracy by 68.8% and reduces average perplexity by 87.7% compared with the strongest static PTQ baseline.