CVNov 17, 2025

ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, Antoni B. Chan

arXiv:2511.13054v110.24 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of underutilized visual information and shortcut learning in video reasoning for MLLMs, representing an incremental advancement with a novel method for a known bottleneck.

The paper tackles the challenge of complex video reasoning in Multimodal Large Language Models (MLLMs) by introducing a self-supervised reinforcement learning algorithm (Pretext-GRPO) and the ViSS-R1 framework, which integrates pretext tasks into the R1 pipeline to enhance visual-centric understanding, resulting in improved performance on six video reasoning benchmarks.

Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

View on arXiv PDF

Similar