ROCVSep 10, 2025

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

arXiv:2509.08757v14 citationsh-index: 28
Originality Synthesis-oriented
AI Analysis

This work addresses the need for systematic evaluation of VLMs in social robot navigation, which is crucial for developing safe and compliant robots in human-centered environments, but it is incremental as it builds on existing VLM and benchmark research.

The paper tackles the problem of evaluating Vision-Language Models (VLMs) for scene understanding in social robot navigation by introducing SocialNav-SUB, a benchmark dataset and framework. The result shows that the best-performing VLM underperforms simpler rule-based and human baselines, indicating gaps in social scene understanding.

Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes