Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
This addresses a specific problem in video understanding for AI researchers, focusing on incremental improvements in memory handling for streaming events.
The paper tackles the problem of processing streaming videos with Multimodal Large Language Models (MLLMs) by showing that using past events as memory improves contextual understanding but can introduce misinformation from predictions, leading to confabulation and degraded performance; they propose a confabulation-aware memory modification method to mitigate this issue, though no concrete numbers are provided in the abstract.
Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.