CVMay 25

Rethinking VLM Representation for VLA Initialization

arXiv:2605.2580278.6
AI Analysis

This work provides design principles for practitioners initializing VLA models from VLMs, clarifying when and how to adapt representations for robot action learning.

The paper investigates what makes a good Vision-Language Model (VLM) representation for initializing Vision-Language-Action (VLA) models, finding that preserving the original pretrained VLM representation is crucial, while embodied VQA adaptation offers conditional benefits and LoRA-based training outperforms full finetuning.

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes