ROAICVLGSep 28, 2025

Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models

arXiv:2509.23655v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses efficiency issues in Vision-Language-Action models for robotics, offering a domain-specific incremental improvement.

The paper tackles the high computational cost of adapting Vision-Language-Models for robotic manipulation by proposing Oat-VLA, an object-agent-centric tokenization method that reduces visual tokens to just a few without performance loss, achieving at least twice the convergence speed of OpenVLA on the LIBERO suite and better performance in real-world pick and place tasks.

Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent's own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes