ROAILGDec 2, 2025

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

arXiv:2512.02902v15 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the generalization problem in VLA models for robotics and AI applications, offering a targeted adaptation approach that is incremental but highly efficient.

The paper tackled the brittleness of Vision-Language-Action (VLA) models under novel camera viewpoints and visual perturbations, showing that misalignment in Spatial Modeling is the primary cause. It proposed lightweight adaptation methods, such as Feature Token Modulation (FTM) and Feature Linear Adaptation (FLA), which improved Libero viewpoint accuracy from 48.5% to 87.1% and 90.8% respectively with minimal parameters.

Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes