AIMay 28

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

arXiv:2605.3011770.9
AI Analysis

For researchers developing interpretable and robust VLA models, this diagnostic framework identifies key behavioral and representational bottlenecks, though the findings are specific to two models and may not generalize.

VLA-Trace diagnoses Vision-Language-Action models by tracing representation dynamics and causal control, revealing that π0.5 and OpenVLA have distinct adaptation dynamics, multimodal routing strategies, and limitations in fine-grained semantic following despite strong visual grounding.

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $π_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes