CVOct 16, 2025

Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

arXiv:2510.14630v17 citationsh-index: 34
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient generative modeling in computer vision, though it is incremental as it builds on existing self-supervised learning methods.

The paper tackles the problem of efficient image generation by introducing Representation Tokenizer (RepTok), which uses a single continuous latent token from self-supervised vision transformers to reduce spatial redundancies and training costs, achieving competitive results on class-conditional ImageNet generation and text-to-image synthesis on MS-COCO with limited budgets.

We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained jointly using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring the latent space remains smooth and suitable for generation. Our single-token formulation resolves spatial redundancies of 2D latent spaces and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and naturally extends to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes