CVAILGFeb 26

Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model

arXiv:2603.00173v1h-index: 3
Originality Incremental advance
AI Analysis

This paper provides practical insights and lessons learned for researchers and engineers developing large-scale video foundation models, particularly regarding the challenges of dataset engineering.

This paper details the development of Summer-22B, a video foundation model trained on approximately 50 million clips. The authors describe their systematic approach to dataset engineering, multi-stage filtering, and training at scale, highlighting that dataset engineering consumed the majority of effort.

We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $μ$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $μ$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes