SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
This work addresses memory constraints for researchers and practitioners training egocentric foundation models, though it is incremental as it builds on existing transformer architectures with sparsification techniques.
The authors tackled the memory footprint problem in pretraining egocentric vision-language transformers by introducing SViTT-Ego, a sparse video-text transformer that integrates edge and node sparsification and uses the EgoNCE objective, resulting in a +2.8% gain on EgoMCQ accuracy compared to LAVILA large.
Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.