CVROMay 8, 2024

OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies

arXiv:2405.05259v142 citationsh-index: 23CVPR
Originality Highly original
AI Analysis

This work addresses the scalability issue in event camera sensing for semantic scene understanding, offering an annotation-efficient solution for robotics and autonomous systems.

The paper tackles the challenge of event-based semantic segmentation by introducing OpenESS, which transfers CLIP knowledge from image-text pairs to event streams without using event or frame labels, achieving 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic benchmarks.

Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue, there exist data representational differences that require additional effort to resolve. In this work, for the first time, we synergize information from image, text, and event-data domains and introduce OpenESS to enable scalable ESS in an open-world, annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation, we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably, we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes