CVSep 1, 2025

GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

arXiv:2509.01109v23 citationsh-index: 8Has Code
Originality Highly original
AI Analysis

This work addresses the limitation of conventional grid-based tokenization for image processing, offering a more adaptive approach that could benefit computer vision tasks like image generation and reconstruction.

The paper tackles the problem of inflexible image tokenization in representation and generation by proposing GPSToken, a framework that uses parametric 2D Gaussians for non-uniform tokenization, achieving state-of-the-art performance with rFID and FID scores of 0.65 and 1.50 using 128 tokens.

Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $\href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes