CVJun 4

VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding, Xinzhu Ma, Kunlin Yang

arXiv:2606.0573684.2

AI Analysis

This work addresses the lack of visual information in CoT reasoning for video understanding, benefiting video reasoning tasks.

VTI-CoT introduces a visual-textual interleaved Chain-of-Thought framework for video reasoning, achieving state-of-the-art performance among same-scale models and significantly improving training efficiency.

Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.

View on arXiv PDF

Similar