AIApr 12

A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

Xiaoda Yang, Shuai Yang, Can Wang, Jingyang Xue, Menglan Tang, Checheng Yu, Xunzhe Zhou, Sashuai Zhou, Tao Jin, Lixin Yang, Xiangyu Yue, Zhou Zhao

arXiv:2604.1050688.3h-index: 15

Predicted impact top 22% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For embodied AI systems requiring robust temporal understanding, this work reduces temporal biases in VLMs, enabling more reliable reasoning across time.

The paper tackles multi-image reasoning hallucination in VLMs, where a large performance gap between forward and reverse temporal queries indicates reliance on shortcuts. Their progressive training strategy reduces this gap from over 70% to 6.53%, demonstrating improved genuine spatiotemporal reasoning.

Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is "multi-image reasoning hallucination", where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To mitigate this, we first develop a new Chain-of-Thought (CoT) dataset that decomposes intricate reasoning into detailed spatiotemporal steps and definitive judgments. Building on this, we present a progressive training framework: it initiates with supervised pre-training on our CoT dataset to instill logical structures, followed by fine-tuning with scalable weakly-labeled data for broader generalization. Our experiments demonstrate that this approach not only improves backbone accuracy but also slashes the forward-backward performance gap from over 70\% to only 6.53\%. This confirms the method's ability to develop authentic dynamic reasoning and reduce the inherent temporal biases of current VLMs.

View on arXiv PDF

Similar