CVApr 7, 2024

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

arXiv:2404.04933v219 citationsh-index: 9Has CodeECCV
Originality Highly original
AI Analysis

This work addresses the synergy between two distinct video understanding tasks, potentially improving efficiency and performance for researchers and practitioners in computer vision.

The paper tackles the problem of unifying Temporal Action Detection (TAD) and Moment Retrieval (MR) in video analysis by proposing a unified architecture called UniMD, which transforms inputs into a common embedding space and uses query-dependent decoders, achieving state-of-the-art results on datasets like Ego4D, Charades-STA, and ActivityNet.

Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes