Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting
This work addresses a significant problem for developers and users of large vision-language models, particularly those relying on accurate spatial relation predictions.
The authors tackled the problem of spatial relation hallucinations in large vision-language models, achieving improved performance on three datasets by reducing incorrect predictions about object positions. Their constraint-aware prompting framework led to more spatially coherent outputs.
Spatial relation hallucinations pose a persistent challenge in large vision-language models (LVLMs), leading to generate incorrect predictions about object positions and spatial configurations within an image. To address this issue, we propose a constraint-aware prompting framework designed to reduce spatial relation hallucinations. Specifically, we introduce two types of constraints: (1) bidirectional constraint, which ensures consistency in pairwise object relations, and (2) transitivity constraint, which enforces relational dependence across multiple objects. By incorporating these constraints, LVLMs can produce more spatially coherent and consistent outputs. We evaluate our method on three widely-used spatial relation datasets, demonstrating performance improvements over existing approaches. Additionally, a systematic analysis of various bidirectional relation analysis choices and transitivity reference selections highlights greater possibilities of our methods in incorporating constraints to mitigate spatial relation hallucinations.