Reconstruction as a Bridge for Event-Based Visual Question Answering
This work addresses event-based visual question answering for general scene understanding in challenging conditions, representing an incremental advance by adapting existing MLLMs to event data.
The paper tackles the challenge of integrating event cameras with Multimodal Large Language Models (MLLMs) for visual question answering by using reconstruction as a bridge, proposing methods like FRT and ART, and achieves state-of-the-art performance on the new EvQA benchmark with 1,000 event-Q&A pairs.
Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.