BRICS: Bi-level feature Representation of Image CollectionS
This work addresses the need for compact and efficient image representation for generative modeling, though it is incremental as it builds on existing autoencoder and diffusion model frameworks.
The authors tackled the problem of representing image collections by introducing BRICS, a bi-level feature representation that combines key codes and feature grids, achieving comparable reconstruction to Vector Quantization with a 50% smaller decoder and state-of-the-art image synthesis performance, such as 29% lower CLIP-FID than LDM on FFHQ and LSUN-Church datasets.
We present BRICS, a bi-level feature representation for image collections, which consists of a key code space on top of a feature grid space. Specifically, our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution feature grids. Our key codes and feature grids are jointly trained continuously with well-defined gradient flows, leading to high usage rates of the feature grids and improved generative modeling compared to discrete Vector Quantization (VQ). Differently from existing continuous representations such as KL-regularized latent codes, our key codes are strictly bounded in scale and variance. Overall, feature encoding by BRICS is compact, efficient to train, and enables generative modeling over key codes using the diffusion model. Experimental results show that our method achieves comparable reconstruction results to VQ while having a smaller and more efficient decoder network (50% fewer GFlops). By applying the diffusion model over our key code space, we achieve state-of-the-art performance on image synthesis on the FFHQ and LSUN-Church (29% lower than LDM, 32% lower than StyleGAN2, 44% lower than Projected GAN on CLIP-FID) datasets.