CVAILGNov 14, 2025

VIDEOP2R: Video Understanding from Perception to Reasoning

arXiv:2511.11113v15 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses video understanding for AI applications, offering a novel method to improve reasoning in video language models, though it builds incrementally on existing reinforcement fine-tuning approaches.

The paper tackles the challenge of extending reinforcement fine-tuning to large video language models by proposing VideoP2R, a process-aware framework that models perception and reasoning separately, achieving state-of-the-art performance on six out of seven video reasoning benchmarks.

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes