CVAICLApr 13, 2023

Verbs in Action: Improving verb understanding in video-language models

arXiv:2304.06708v196 citationsh-index: 188
Originality Highly original
AI Analysis

This addresses a critical bottleneck for video-language models in applications requiring action and temporal understanding, offering a novel solution rather than just highlighting the issue.

The paper tackles the problem of limited verb understanding in CLIP-based video-language models, which restricts their performance in real-world video applications, by proposing a Verb-Focused Contrastive (VFC) framework that achieves state-of-the-art results in zero-shot performance on three downstream tasks.

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes