CVAug 25, 2025

TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

arXiv:2508.17595v1h-index: 12025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Highly original
AI Analysis

This addresses spatial reasoning challenges for industrial logistics applications, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackled the problem of fine-grained spatial reasoning in warehouse-scale environments by proposing TinyGiantVLM, a lightweight vision-language architecture, achieving 5th place on the AI City Challenge 2025 leaderboard with a score of 66.8861 using a 64M-parameter model.

Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings. In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is conducted in a two-phase strategy: the first phase focuses on generating free-form answers to enhance spatial reasoning ability, while the second phase uses normalized answers for evaluation. Evaluated on Track 3 of the AI City Challenge 2025, our 64M-parameter base model achieved 5th place on the leaderboard with a score of 66.8861, demonstrating strong performance in bridging visual perception and spatial understanding in industrial environments. We further present an 80M-parameter variant with expanded MoE capacity, which demonstrates improved performance on spatial reasoning tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes