LG CVFeb 4, 2025

BRIDLE: Generalized Self-supervised Learning with Quantization

Hoang M. Nguyen, Satya N. Shukla, Qiang Zhang, Hanchao Yu, Sreya D. Roy, Taipeng Tian, Lingjiong Zhu, Yuchen Liu

arXiv:2502.02118v17.11 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This addresses inefficiencies in self-supervised learning for researchers and practitioners by enhancing representation quality across multiple domains, though it is incremental as it builds on existing bidirectional and quantization methods.

The paper tackles the limitations of single-codebook quantization in self-supervised learning by introducing BRIDLE, a framework that uses residual quantization and bidirectional training to improve representation quality across audio, image, and video modalities, achieving state-of-the-art results on audio classification benchmarks and competitive performance on image and video tasks.

Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.

View on arXiv PDF Code

Similar