OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras
This addresses the need for adaptable semantic segmentation in autonomous driving using event cameras, but it is incremental as it builds on existing foundation models and techniques.
The paper tackles the problem of open-vocabulary semantic segmentation for event cameras, which lack large-scale data, by introducing OVOSE, an algorithm that uses synthetic data and knowledge distillation to achieve superior performance on driving datasets like DDD17 and DSEC-Semantic compared to adapted image models and closed-set methods.
Event cameras, known for low-latency operation and superior performance in challenging lighting conditions, are suitable for sensitive computer vision tasks such as semantic segmentation in autonomous driving. However, challenges arise due to limited event-based data and the absence of large-scale segmentation benchmarks. Current works are confined to closed-set semantic segmentation, limiting their adaptability to other applications. In this paper, we introduce OVOSE, the first Open-Vocabulary Semantic Segmentation algorithm for Event cameras. OVOSE leverages synthetic event data and knowledge distillation from a pre-trained image-based foundation model to an event-based counterpart, effectively preserving spatial context and transferring open-vocabulary semantic segmentation capabilities. We evaluate the performance of OVOSE on two driving semantic segmentation datasets DDD17, and DSEC-Semantic, comparing it with existing conventional image open-vocabulary models adapted for event-based data. Similarly, we compare OVOSE with state-of-the-art methods designed for closed-set settings in unsupervised domain adaptation for event-based semantic segmentation. OVOSE demonstrates superior performance, showcasing its potential for real-world applications. The code is available at https://github.com/ram95d/OVOSE.