CVOct 21, 2025

SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation

arXiv:2510.18716v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses efficiency challenges for resource-constrained hardware in image generation, representing an incremental improvement by adapting compression techniques from language modeling to the image domain.

The paper tackles the high memory and computational demands of autoregressive image generation models by introducing a KV cache compression framework that decouples attention heads based on spatial locality and semantic sink phenomena, achieving a 5× reduction in memory usage and a 6.6× speedup in throughput with minimal quality loss.

Autoregressive image generation models like Janus-Pro produce high-quality images, but at the significant cost of high memory and ever-growing computational demands due to the large number of visual tokens. While KV cache compression has been extensively studied in language modeling, it still remains largely unexplored for the image generation domain. In this work, we begin by identifying a distinct and prominent attention phenomenon, which we term spatial locality and emergent semantic sink. To leverage this key insight, we introduce a novel KV cache compression framework. Specifically, we compress the KV cache for all visual tokens by adaptively decoupling attention heads into two separate types: for spatial-locality heads, our method maintains a short recent token window; for semantic-sink heads, it strategically preserves a compact set of highly-attended tokens. Our extensive experiments demonstrate that the proposed method achieves a 5$\times$ reduction in memory usage and a notable 6.6$\times$ speedup in overall throughput with only minimal visual quality loss, thereby enabling highly efficient native autoregressive image generation on resource-constrained hardware.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes