CVLGDec 18, 2024

TRecViT: A Recurrent Video Transformer

DeepMind
arXiv:2412.14294v14 citationsh-index: 75Has CodeTrans. Mach. Learn. Res.
Originality Highly original
AI Analysis

This work addresses efficient video processing for computer vision applications, offering a causal model with substantial computational savings, though it is incremental as it builds on existing transformer and recurrent methods.

The paper tackles video modeling by proposing a novel block with time-space-channel factorization, using gated linear recurrent units for time, self-attention for space, and MLPs for channels, resulting in the TRecViT architecture that outperforms or matches ViViT-L on large-scale datasets with significantly fewer parameters, memory, and FLOPs.

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes