CVJun 19, 2025

How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?

arXiv:2506.16450v12 citationsh-index: 37ICIAP
Originality Incremental advance
AI Analysis

This addresses memory-efficient video question answering for AI systems, though it is incremental as it adapts existing models to a new task.

The study tackled Online Episodic-Memory Video Question Answering by using off-the-shelf Multimodal Large Language Models without training, achieving 56.0% accuracy with 3.6 kB per minute storage, matching state-of-the-art systems while being 10^4/10^5 times more memory-efficient.

We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 10**4/10**5 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes