ROJun 1

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

arXiv:2606.0227795.8
Predicted impact top 5% in RO · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing VLA models, this benchmark diagnoses a critical failure in semantic grounding that existing evaluations miss.

The authors introduce RoboSemanticBench, a benchmark to test whether VLA models use instruction semantics to select the correct physical target. They find that many policies select the semantically correct block at near-random rates, revealing a gap between semantic understanding and action prediction.

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes