Laminating Representation Autoencoders for Efficient Diffusion
This work addresses the computational inefficiency of diffusion models for image generation, offering significant speed-ups for researchers and practitioners, though it is incremental as it builds on existing SSL and diffusion methods.
The paper tackles the redundancy in dense patch features used by diffusion models by introducing FlatDINO, a variational autoencoder that compresses representations into a 1D sequence of 32 tokens, achieving an 8x reduction in sequence length and 48x compression in dimensionality. On ImageNet 256x256, a DiT-XL model trained on these latents achieves a gFID of 1.80 with 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to using uncompressed DINOv2 features.
Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.