CVApr 25, 2025

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Kesen Zhao, Beier Zhu, Qianru Sun, Hanwang Zhang

arXiv:2504.18397v225 citationsh-index: 18Has Code

Originality Highly original

AI Analysis

This addresses the challenge of improving visual comprehension in AI models for tasks like spatial reasoning, where existing methods rely on supervised data that is hard to generalize, making it a novel method for a known bottleneck.

The paper tackles the problem of enabling multimodal large language models to perform visual chain-of-thought reasoning without requiring labeled bounding-box data, by introducing an unsupervised framework called UV-CoT that uses preference optimization and automatic data generation, achieving superior performance on six datasets and strong generalization in zero-shot testing on four unseen datasets.

Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.

View on arXiv PDF Code

Similar