LGCVFeb 19

Unified Latents (UL): How to train your latents

arXiv:2602.17270v19 citationsh-index: 35
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient and high-quality latent representation learning for image and video generation, with incremental improvements in specific benchmarks.

The paper tackles the problem of learning latent representations by proposing Unified Latents (UL), a framework that jointly regularizes latents with a diffusion prior and decodes them with a diffusion model, achieving competitive FID of 1.4 on ImageNet-512 and a new state-of-the-art FVD of 1.3 on Kinetics-600.

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes