CVDec 2, 2024

SEAL: Semantic Attention Learning for Long Video Representation

arXiv:2412.01798v39 citationsh-index: 7CVPR
Originality Highly original
AI Analysis

This work addresses the problem of high computational complexity and redundancy in long video processing for AI and computer vision applications, offering a novel method that is versatile for various downstream tasks.

The paper tackles the challenge of long video understanding by introducing SEAL, a unified representation that decomposes videos into semantic entities and uses attention learning to reduce redundancy, achieving significant performance improvements over state-of-the-art methods in video question answering and temporal grounding tasks across multiple benchmarks.

Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must efficiently process such redundancy while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a compact set of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity, formulated as a subset selection optimization problem. Our representation is versatile and applicable across various long video understanding tasks. Extensive experiments demonstrate that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks across diverse benchmarks, including LVBench, MovieChat-1K, and Ego4D.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes