An Empirical Study of Autoregressive Pre-training from Videos
This work addresses the challenge of scalable visual representation learning for AI systems, though it is incremental as it applies existing autoregressive methods from language to video data.
The authors tackled the problem of learning visual representations through autoregressive pre-training on videos, achieving competitive performance across image recognition, video classification, object tracking, and robotics benchmarks.
We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/