CVAIJan 23, 2025

EventVL: Understand Event Streams via Multimodal Large Language Model

arXiv:2501.13707v212 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses the need for explicit semantic understanding in event streams for applications like driving or human motion, representing an incremental advancement in event vision.

The authors tackled the problem of insufficient semantic understanding in event-based vision-language models by proposing EventVL, a generative multimodal large language model that significantly outperforms existing baselines in event captioning and scene description tasks.

The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and segmenting the event stream. To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events. Extensive experiments show that our EventVL can significantly surpass existing MLLM baselines in event captioning and scene description generation tasks. We hope our research could contribute to the development of the event vision community.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes