CVAIMay 26

Cross-scale Aligned Supervision for Training GANs

arXiv:2605.2644972.7
AI Analysis

For researchers in generative modeling, this work addresses a fundamental flaw in multi-scale GAN training, leading to state-of-the-art results on class-conditional image generation.

The paper identifies a cross-scale trajectory misalignment problem in GANs with multi-stage synthesis and proposes CAT, a Cross-scale Aligned Transformer, which adds a consistency regularization to align intermediate outputs with the final output. On ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after 60 epochs, outperforming strong baselines.

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes