CVJan 30

Structured Over Scale: Learning Spatial Reasoning from Educational Video

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

arXiv:2601.23251v11 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the problem of improving reasoning capabilities in VLMs for researchers and practitioners, though it is incremental as it builds on existing methods with a new dataset and fine-tuning approach.

The paper tackled the failure of vision-language models on simple reasoning tasks like spatial reasoning by using pedagogically-structured educational videos for training, achieving improvements of 8-14 points on DoraVQA and state-of-the-art 86.16% on CVBench with strong transfer to other benchmarks.

Vision-language models (VLMs) demonstrate impressive performance on standard video understanding benchmarks yet fail systematically on simple reasoning tasks that preschool children can solve, including counting, spatial reasoning, and compositional understanding. We hypothesize that the pedagogically-structured content of educational videos provides an ideal training signal for improving these capabilities. We introduce DoraVQA, a dataset of 5,344 question-answer pairs automatically extracted from 8 seasons of Dora the Explorer with precise timestamp alignment. Each episode follows a consistent \textit{context-question-pause-answer} structure that creates a self-contained learning environment analogous to interactive tutoring. We fine-tune both Qwen2 and Qwen3 using Group Relative Policy Optimization (GRPO), leveraging the clear correctness signals and structured reasoning traces inherent in educational content. Despite training exclusively on 38 hours of children's educational videos, our approach achieves improvements of 8-14 points on DoraVQA and state-of-the-art 86.16\% on CVBench, with strong transfer to Video-MME and NExT-QA, demonstrating effective generalization from narrow pedagogical content to broad multimodal understanding. Through cross-domain benchmarks, we show that VLMs can perform tasks that require robust reasoning learned from structured educational content, suggesting that content structure matters as much as content scale.

View on arXiv PDF

Similar