CVAIFeb 2

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

arXiv:2602.02493v15 citationsh-index: 5Has Code
AI Analysis

This work addresses a key bottleneck in generative modeling for image synthesis, offering a simpler and more powerful paradigm that could benefit researchers and practitioners in computer vision and AI.

The paper tackles the challenge of pixel diffusion models lagging behind latent diffusion due to high-dimensional pixel manifolds with perceptually irrelevant signals, and proposes PixelGen, a pixel diffusion framework with perceptual supervision that achieves an FID of 5.11 on ImageNet-256 and a GenEval score of 0.79 in text-to-image generation.

Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes