InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
This work addresses the bottleneck of rigid tokenization in video processing, offering a more efficient solution for applications like video representation, though it is incremental in improving existing methods.
The paper tackles the problem of inefficient video tokenization by introducing InfoTok, an adaptive tokenizer based on information theory, which saves 20% tokens without performance loss and achieves 2.3x compression rates while outperforming prior methods.
Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.