AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives
This work addresses the challenge of practical efficacy in embodied AI systems for autonomous navigation, though it is incremental as it focuses on benchmarking rather than novel method development.
The authors tackled the problem of cross-domain generalization in open-vocabulary semantic segmentation by introducing AetherVision-Bench, a benchmark for multi-angle segmentation across aerial and ground perspectives, and found that it facilitates evaluation of state-of-the-art models and identifies key factors impacting zero-shot transfer performance.
Open-vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross-domain generalization, hindering their practical efficacy in real-world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision-Bench, a benchmark for multi-angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state-of-the-art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero-shot transfer models. Our work pioneers the creation of a robustness benchmark, offering valuable insights and establishing a foundation for future research.