CVMar 10, 2025

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

arXiv:2503.06934v14 citationsh-index: 3

Originality Highly original

AI Analysis

This addresses the challenge of weak language-vision temporal coordination in scene understanding for AI systems, representing an incremental improvement with a novel fusion method.

The paper tackled the problem of fine-grained spatiotemporal reasoning in large multimodal models by introducing LLaFEA, which leverages event cameras for dense perception and frame-event fusion, resulting in improved alignment and performance validated through experiments on a real-world dataset.

Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.

View on arXiv PDF

Similar