CVLGApr 2

Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

arXiv:2604.0184313.3h-index: 8
AI Analysis

This work addresses the need for more interpretable and flexible discrete representations in image generation, though it is incremental in exploring trade-offs like separability.

The paper tackled the problem of position-dependent discrete image representations by proposing a permutation-invariant vector-quantized autoencoder (PI-VQ) that learns codes without positional information, enabling direct interpolation and achieving competitive synthesis metrics on datasets like CelebA and FFHQ.

Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes