Energy-Aware Imitation Learning for Steering Prediction Using Events and Frames
This work improves autonomous driving safety by mitigating vision sensor limitations, though it is incremental as it builds on existing multimodal fusion methods.
The paper tackles steering prediction in autonomous driving by combining event cameras with frame-based cameras to address inaccuracies from motion and lighting, achieving state-of-the-art performance on DDD20 and DRFuser datasets.
In autonomous driving, relying solely on frame-based cameras can lead to inaccuracies caused by factors like long exposure times, high-speed motion, and challenging lighting conditions. To address these issues, we introduce a bio-inspired vision sensor known as the event camera. Unlike conventional cameras, event cameras capture sparse, asynchronous events that provide a complementary modality to mitigate these challenges. In this work, we propose an energy-aware imitation learning framework for steering prediction that leverages both events and frames. Specifically, we design an Energy-driven Cross-modality Fusion Module (ECFM) and an energy-aware decoder to produce reliable and safe predictions. Extensive experiments on two public real-world datasets, DDD20 and DRFuser, demonstrate that our method outperforms existing state-of-the-art (SOTA) approaches. The codes and trained models will be released upon acceptance.