A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion
This addresses a critical evaluation gap for researchers and practitioners in controllable diffusion, offering guidance for more reliable benchmarking and model selection, though it is incremental in refining existing methods.
The paper identifies a systematic bias in autoencoder evaluation for latent diffusion models, where generative metrics like gFID are prioritized over reconstruction fidelity, which risks condition drift and limits controllability in scalable diffusion tasks. It shows that reconstruction-oriented metrics better predict controllability, with ControlNet experiments confirming this alignment.
In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.