CVSep 9, 2024

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

Chenjing Ding, Chiyu Wang, Boshi Liu, Xi Guo, Weixuan Tang, Wei Wu

arXiv:2409.06105v13.71 citationsh-index: 5

Originality Highly original

AI Analysis

This work addresses a bottleneck in self-supervised learning for computer vision by improving token semantics and codebook utilization, offering a simple, parameter-free method with potential for broad application in downstream tasks.

The paper tackled the lack of semantics and inefficiencies in vector quantization tokenizers for image representation by introducing SGC-VQGAN, which uses semantic online clustering to enhance token semantics and a pyramid feature learning pipeline, achieving state-of-the-art performance in reconstruction quality and downstream tasks.

Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.

View on arXiv PDF

Similar