CV AINov 14, 2025

AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning

Jirong Zha, Yuxuan Fan, Tianyu Zhang, Geng Chen, Yingfeng Chen, Chen Gao, Xinlei Chen

arXiv:2511.11025v118.210 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses a critical gap for researchers and developers in AI and robotics by providing a benchmark to evaluate MLLMs in complex, egocentric collaborative scenarios under degraded perception conditions, though it is incremental as it builds on existing MLLM and benchmark work.

The paper tackles the lack of benchmarks for evaluating multi-agent collaborative perception in multimodal large language models (MLLMs) by introducing AirCopBench, a comprehensive benchmark with 14.6k+ questions from simulator and real-world data, which reveals significant performance gaps, with the best model trailing humans by 24.38% on average.

Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions.To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.

View on arXiv PDF

Similar