LG AIOct 28, 2025

SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation

Gio Huh, Dhruv Sheth, Rayhan Zirvi, Frank Xiao

arXiv:2511.00054v1h-index: 2

Originality Incremental advance

AI Analysis

This addresses the data-efficiency bottleneck for deploying efficient models in spatial reasoning, though it is incremental as it builds on existing distillation and verification methods.

The paper tackled the problem of generating high-quality, step-by-step reasoning data for fine-tuning smaller models in spatial reasoning tasks, where VLMs struggle, and resulted in a 17% improvement in trace quality and over 40% reduction in variance on the CLEVR-Humans benchmark.

While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17\% while reducing quality variance by over 40\%. SpatialTraceGen delivers a dataset of expert traces, providing the structured, step-by-step examples of tool use necessary for effective fine-tuning and sample-efficient offline reinforcement learning.

View on arXiv PDF

Similar