CVJan 29

A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

arXiv:2601.21633v1h-index: 5
Originality Incremental advance
AI Analysis

This addresses a critical evaluation gap for researchers and practitioners in controllable diffusion, offering guidance for more reliable benchmarking and model selection, though it is incremental in refining existing methods.

The paper identifies a systematic bias in autoencoder evaluation for latent diffusion models, where generative metrics like gFID are prioritized over reconstruction fidelity, which risks condition drift and limits controllability in scalable diffusion tasks. It shows that reconstruction-oriented metrics better predict controllability, with ControlNet experiments confirming this alignment.

In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes