CVMar 18, 2025

Make Your Training Flexible: Towards Deployment-Efficient Video Models

arXiv:2503.14237v17 citationsh-index: 27Has Code
Originality Highly original
AI Analysis

This addresses the need for deployment-efficient video models that can adapt to varying computational budgets in real-world applications, representing a novel method for a known bottleneck.

The paper tackles the problem of sub-optimal accuracy-computation trade-offs in video models due to fixed token sampling, proposing Flux, a flexible augmentation tool that optimizes token selection to boost model robustness with minimal cost, achieving new state-of-the-art results and matching previous SOTA performance with only 1/4 tokens, yielding nearly 90% savings.

Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes