CVApr 13

Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

arXiv:2604.1108931.5h-index: 5
AI Analysis

For researchers in image generation and representation learning, this provides a principled way to improve tokenizer latent spaces for generative modeling without sacrificing reconstruction quality.

This work introduces a structured state-space regularization for image tokenizers that aligns latent spaces with compactness and generation-friendliness. The method improves generation quality in diffusion models (e.g., FID reduction by 0.5-1.0 on ImageNet) while incurring minimal loss in reconstruction fidelity (e.g., rFID increase <0.1).

Image tokenizers are central to modern vision models as they often operate in latent spaces. An ideal latent space must be simultaneously compact and generation-friendly: it should capture image's essential content compactly while remaining easy to model with generative approaches. In this work, we introduce a novel regularizer to align latent spaces with these two objectives. The key idea is to guide tokenizers to mimic the hidden state dynamics of state-space models (SSMs), thereby transferring their critical property, frequency awareness, to latent features. Grounded in a theoretical analysis of SSMs, our regularizer enforces encoding of fine spatial structures and frequency-domain cues into compact latent features; leading to more effective use of representation capacity and improved generative modelability. Experiments demonstrate that our method improves generation quality in diffusion models while incurring only minimal loss in reconstruction fidelity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes