CVLGDec 8, 2021

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

arXiv:2112.04480v17 citations
Originality Highly original
AI Analysis

This work addresses video representation learning for computer vision tasks, offering a flexible approach that reveals insights about temporal granularity requirements.

The paper tackles the problem of learning video representations by exploring temporal granularity through a self-supervised framework called TeG, which achieves state-of-the-art results on 8 video benchmarks and outperforms supervised pre-training in most cases.

This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract their dense temporal embeddings. The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips. Our study reveals the impact of temporal granularity with three major findings. 1) Different video tasks may require features of different temporal granularities. 2) Intriguingly, some tasks that are widely considered to require temporal awareness can actually be well addressed by temporally persistent features. 3) The flexibility of TeG gives rise to state-of-the-art results on 8 video benchmarks, outperforming supervised pre-training in most cases.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes