EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion
This work addresses multi-modal fusion for vision tasks like object detection and segmentation, offering a practical solution for applications requiring high temporal resolution, but it is incremental as it builds on existing RGB-based models.
The paper tackles the challenge of integrating event cameras with RGB cameras for vision tasks by proposing EvPlug, a plug-and-play fusion module that uses unlabeled event-image pairs to enhance RGB-based models, resulting in improved robustness to high dynamic range and fast motion scenes without altering the original model structure.
Event cameras and RGB cameras exhibit complementary characteristics in imaging: the former possesses high dynamic range (HDR) and high temporal resolution, while the latter provides rich texture and color information. This makes the integration of event cameras into middle- and high-level RGB-based vision tasks highly promising. However, challenges arise in multi-modal fusion, data annotation, and model architecture design. In this paper, we propose EvPlug, which learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model. The learned fusion module integrates event streams with image features in the form of a plug-in, endowing the RGB-based model to be robust to HDR and fast motion scenes while enabling high temporal resolution inference. Our method only requires unlabeled event-image pairs (no pixel-wise alignment required) and does not alter the structure or weights of the RGB-based model. We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation