CVAIDec 11, 2024

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

arXiv:2412.10443v37 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the need for efficient video tokenization in AI applications, offering a novel method with significant performance gains, though it is incremental in advancing existing tokenization techniques.

The paper tackled the problem of compact video discretization by introducing SweetTok, a semantic-aware spatial-temporal tokenizer that decouples spatial and temporal queries, achieving a 42.8% improvement in video reconstruction on UCF-101 and a 15.1% boost in downstream video generation.

This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via \textbf{D}ecoupled \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving superior fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a \textbf{M}otion-enhanced \textbf{L}anguage \textbf{C}odebook (MLC) tailored for spatial and temporal compression to address the differences in semantic representation between appearance and motion information. SweetTok significantly improves video reconstruction results by \textbf{42.8\%} w.r.t rFVD on UCF-101 dataset. With a better token compression strategy, it also boosts downstream video generation results by \textbf{15.1\%} w.r.t gFVD. Additionally, the compressed decoupled tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes