CVMay 20

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

Yisen Feng, Leigang Qu, Haoyu Zhang, Qiaohui Chu, Meng Liu, Xuemeng Song, Weili Guan, Liqiang Nie

arXiv:2605.2081885.8Has Code

Predicted impact top 21% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers in egocentric video understanding, this work provides a champion solution for temporal localization tasks, though it is an incremental combination of existing methods.

The authors propose a reranking framework combining OSGNet with MLLM to improve temporal localization in egocentric videos, achieving first place in both tracks of the Ego4D Episodic Memory Challenge at CVPR 2026.

In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at https://github.com/iLearn-Lab/CVPR25-OSGNet.

View on arXiv PDF Code

Similar