CVMay 24, 2025

CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

arXiv:2505.18561v36 citationsh-index: 11
Originality Highly original
AI Analysis

This addresses the problem of accurately segmenting objects in videos based on complex queries for computer vision applications, representing a novel approach rather than an incremental improvement.

The paper tackles the challenge of reasoning video object segmentation with complex text queries by proposing CoT-RVS, a training-free framework that uses zero-shot Chain-of-Thought reasoning in MLLMs to integrate temporal and semantic analysis, achieving significant performance improvements over previous methods.

Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes