CVAILGOct 5, 2025

MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering

arXiv:2510.04220v1h-index: 2
Originality Highly original
AI Analysis

This addresses a fundamental bottleneck in autoregressive image generation for researchers and practitioners, offering a plug-and-play solution to boost efficiency and quality.

The paper tackled the inefficiency of autoregressive image generation models by proposing MASC, a framework that structures the token vocabulary into a hierarchical semantic tree, which accelerated training by up to 57% and improved generation quality by reducing FID from 2.87 to 2.58.

Autoregressive (AR) models have shown great promise in image generation, yet they face a fundamental inefficiency stemming from their core component: a vast, unstructured vocabulary of visual tokens. This conventional approach treats tokens as a flat vocabulary, disregarding the intrinsic structure of the token embedding space where proximity often correlates with semantic similarity. This oversight results in a highly complex prediction task, which hinders training efficiency and limits final generation quality. To resolve this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled framework that constructs a hierarchical semantic tree directly from the codebook's intrinsic structure. MASC employs a novel geometry-aware distance metric and a density-driven agglomerative construction to model the underlying manifold of the token embeddings. By transforming the flat, high-dimensional prediction task into a structured, hierarchical one, MASC introduces a beneficial inductive bias that significantly simplifies the learning problem for the AR model. MASC is designed as a plug-and-play module, and our extensive experiments validate its effectiveness: it accelerates training by up to 57% and significantly improves generation quality, reducing the FID of LlamaGen-XL from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly competitive with state-of-the-art methods, establishing that structuring the prediction space is as crucial as architectural innovation for scalable generative modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes