CVFeb 18

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

arXiv:2602.16412v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses computational inefficiency in video MLLMs for researchers and practitioners, but it is incremental as it builds on existing motion representation methods.

The paper tackled long-video understanding by multimodal large language models, proposing ReMoRa which uses refined motion representations to avoid processing full RGB frames, and it outperformed baselines on benchmarks like LongVideoBench, NExT-QA, and MLVU.

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes