CVITLGIVJun 11, 2024

Image and Video Tokenization with Binary Spherical Quantization

arXiv:2406.07548v187 citations
Originality Highly original
AI Analysis

This addresses the need for efficient and scalable visual data compression and synthesis, with incremental improvements in tokenization for machine learning applications.

The paper tackles the problem of tokenizing images and videos efficiently by proposing a transformer-based tokenizer with Binary Spherical Quantization (BSQ), which compresses visual data by up to 100× with minimal distortion and achieves state-of-the-art reconstruction quality with 2.4× throughput compared to prior methods.

We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes