CVAIJan 26, 2025

GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting

arXiv:2501.15619v19 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of image tokenization for multi-modal tasks, offering an incremental improvement over existing vector quantization methods.

The paper tackles the limited representational ability of discrete image tokenizers by proposing GaussianToken, which uses 2D Gaussian splatting to enhance representation, achieving competitive reconstruction performance on datasets like CIFAR, Mini-ImageNet, and ImageNet-1K.

Effective image tokenization is crucial for both multi-modal understanding and generation tasks due to the necessity of the alignment with discrete text data. To this end, existing approaches utilize vector quantization (VQ) to project pixels onto a discrete codebook and reconstruct images from the discrete representation. However, compared with the continuous latent space, the limited discrete codebook space significantly restrict the representational ability of these image tokenizers. In this paper, we propose GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting as a solution. We first represent the encoded samples as multiple flexible featured 2D Gaussians characterized by positions, rotation angles, scaling factors, and feature coefficients. We adopt the standard quantization for the Gaussian features and then concatenate the quantization results with the other intrinsic Gaussian parameters before the corresponding splatting operation and the subsequent decoding module. In general, GaussianToken integrates the local influence of 2D Gaussian distribution into the discrete space and thus enhances the representation capability of the image tokenizer. Competitive reconstruction performances on CIFAR, Mini-ImageNet, and ImageNet-1K demonstrate the effectiveness of our framework. Our code is available at: https://github.com/ChrisDong-THU/GaussianToken.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes