CVJul 11, 2025

ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way

arXiv:2507.08679v21 citationsh-index: 12025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This addresses the issue of hallucination and spatial reasoning in MLLMs for applications like visual question answering, though it is incremental as it builds on existing prompting and depth estimation techniques.

The paper tackles the problem of enhancing Multimodal Large Language Models (MLLMs) for spatial reasoning and grounding without training, introducing ByDeWay, which uses depth-based prompting to segment scenes and generate captions, resulting in consistent improvements on benchmarks like POPE and GQA.

We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes