Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection
This work addresses the problem of open-vocabulary video understanding for researchers and practitioners by offering a training-free approach, though it is incremental as it builds on existing frozen models with a novel selection module.
The paper tackles video goal inference and action recognition without training by introducing VidTFS, a framework that combines frozen vision and language models with a dynamic frame selection module, achieving improved performance on datasets like CrossTask and COIN compared to existing multimodal models.
We introduce VidTFS, a Training-free, open-vocabulary video goal and action inference framework that combines the frozen vision foundational model (VFM) and large language model (LLM) with a novel dynamic Frame Selection module. Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly. We validate the performance of the proposed VidTFS on four widely used video datasets, including CrossTask, COIN, UCF101, and ActivityNet, covering goal inference and action recognition tasks under open-vocabulary settings without requiring any training or fine-tuning. The results show that VidTFS outperforms pretrained and instruction-tuned multimodal language models that directly stack LLM and VFM for downstream video inference tasks. Our VidTFS with its adaptability shows the future potential for generalizing to new training-free video inference tasks.