Point What You Mean: Visually Grounded Instruction Policy
This addresses the problem of referential ambiguity in embodied control for robotics or AI systems, representing an incremental improvement by enhancing existing VLA models with visual grounding.
The paper tackles the problem of limited object referring ability in Vision-Language-Action models in cluttered or out-of-distribution scenes by introducing Point-VLA, a plug-and-play policy that augments language instructions with visual cues like bounding boxes, resulting in consistently stronger performance than text-only models, especially in challenging scenarios.
Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.