CVAILGNov 22, 2024

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

arXiv:2411.14762v44 citationsh-index: 18CVPR
Originality Incremental advance
AI Analysis

This work addresses the problem of high training costs for video tokenization in vision models, enabling more memory-efficient training for long video processing, though it is incremental as it builds on existing 3D generative model techniques.

The paper tackles the challenge of efficient tokenization for long videos by introducing CoordTok, a video tokenizer that uses coordinate-based patch reconstruction to reduce token counts, achieving a reduction from 6144 or 8192 tokens to 1280 tokens for a 128-frame video at 128x128 resolution while maintaining similar quality.

Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes