Summer-22B: A Systematic Approach to Dataset Engineering and Training at Scale for Video Foundation Model
This paper provides practical insights and lessons learned for researchers and engineers developing large-scale video foundation models, particularly regarding the challenges of dataset engineering.
This paper details the development of Summer-22B, a video foundation model trained on approximately 50 million clips. The authors describe their systematic approach to dataset engineering, multi-stage filtering, and training at scale, highlighting that dataset engineering consumed the majority of effort.
We describe our experience training Summer-22B, a video foundation model developed from scratch. This report documents the engineering challenges, design decisions, and lessons learned while scaling from raw footage collection to a functional model trained on approximately 50 million clips. We outline our approach combining metadata-driven dataset curation, multi-stage filtering, $μ$P parameterization, and hypersphere-constrained optimization. We developed the Lavender Data system for dataset management and adopted inference-aware architectural choices. We share observations on what worked in our setting: dataset engineering consumed the majority of effort, architectural variants showed smaller differences than we expected, and $μ$P hyperparameter transfer appeared effective even under geometric constraints. We hope this account proves useful to others undertaking similar projects.