Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

Ninghao Zhang, Bin Zhu, Shijie Zhou, Jingjing Chen

arXiv:2603.06001v13 citations

Predicted impact top 14% in RO · last 90 daysOriginality Highly original

AI Analysis

This work addresses a critical reliability issue for generalist robotic policies by identifying and mitigating "linguistic blindness" in VLA models, which is crucial for safe and robust robot operation.

This paper identifies "linguistic blindness" in Vision-Language-Action (VLA) models, where models prioritize visual information over contradictory language instructions, leading to erroneous actions. To address this, they propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that reduces erroneous execution under Out-of-Distribution contradictory instructions while maintaining baseline task performance across 30 LIBERO tasks and on a real robot.

Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.

View on arXiv PDF

Similar