Training a Large Video Model on a Single Machine in a Day
This work addresses the bottleneck of video model training for researchers and academia by making it more accessible and efficient, though it is incremental as it optimizes existing bottlenecks rather than introducing a new paradigm.
The paper tackles the problem of training large video models, which typically require clusters of many GPUs over several days, by developing an efficient pipeline that enables training a state-of-the-art video model on a single machine with eight consumer-grade GPUs in a day, achieving higher accuracies with 1/8 of the computation compared to prior work.
Videos are big, complex to pre-process, and slow to train on. State-of-the-art large-scale video models are trained on clusters of 32 or more GPUs for several days. As a consequence, academia largely ceded the training of large video models to industry. In this paper, we show how to still train a state-of-the-art video model on a single machine with eight consumer-grade GPUs in a day. We identify three bottlenecks, IO, CPU, and GPU computation, and optimize each. The result is a highly efficient video training pipeline. For comparable architectures, our pipeline achieves higher accuracies with $\frac{1}{8}$ of the computation compared to prior work. Code is available at https://github.com/zhaoyue-zephyrus/AVION.