CVMay 27

Rethinking Video-Language Model from the Language Input Perspective

arXiv:2605.2792031.916 citationsh-index: 28
Predicted impact top 19% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For practitioners deploying VLMs in real-world applications, this work offers a practical solution to handle diverse text inputs without manual template engineering.

This paper addresses the limitation of Video-Language Models (VLMs) that assume predefined text templates, which are impractical and degrade performance. The proposed plug-and-play framework generates positive/negative texts and uses attribute-based reasoning with a self-weighted loss, improving SOTA VLMs by up to 3.2% on standard benchmarks.

Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes