ROMay 29

Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

Tianle Zeng, Yanci Wen, Xueang Yu, Hong Zhang

arXiv:2605.3106667.7Has Code

Predicted impact top 28% in RO · last 90 daysOriginality Incremental advance

AI Analysis

This paper addresses a critical gap in evaluating the cooperative capabilities of aerial VLA models for robotics and autonomous systems, highlighting the limitations of current paradigms for achieving stable air-ground coordination.

This paper investigates the ability of aerial Vision-Language-Action (VLA) models to perform air-ground cooperation, specifically in tasks like moving-platform landing and occlusion-recovery escort. It finds that while current models can track ground partners, they struggle with stable cooperative behavior, with state prompting offering limited benefit and naive bidirectional interaction often amplifying errors.

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.

View on arXiv PDF Code

Similar