SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook
This work addresses a bottleneck in self-supervised learning for computer vision by improving token semantics and codebook utilization, offering a simple, parameter-free method with potential for broad application in downstream tasks.
The paper tackled the lack of semantics and inefficiencies in vector quantization tokenizers for image representation by introducing SGC-VQGAN, which uses semantic online clustering to enhance token semantics and a pyramid feature learning pipeline, achieving state-of-the-art performance in reconstruction quality and downstream tasks.
Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.