CVAICLMar 17, 2025

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

arXiv:2503.13377v399 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses the problem of improving video understanding for AI systems, though it is incremental as it builds on existing LVLM methods.

The paper tackles the limited generalization of Large Vision-Language Models in Temporal Video Grounding by proposing a post-training framework using reinforcement learning, achieving state-of-the-art performance across multiple datasets with only 2.5K training data.

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes