CVAILGSep 11, 2025

Video Understanding by Design: How Datasets Shape Architectures and Insights

arXiv:2509.09151v1h-index: 16
Originality Synthesis-oriented
AI Analysis

It offers a novel dataset-driven perspective for researchers in video understanding, though it is incremental as it synthesizes existing work into a new framework.

This survey tackles the problem of understanding how video datasets shape model architectures by analyzing how dataset characteristics like motion complexity and temporal span impose inductive biases, and it provides a framework to guide future design for general-purpose video understanding.

Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes