CVAILGOct 16, 2025

MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos

DeepMind
arXiv:2510.14904v21 citationsh-index: 11
AI Analysis

This addresses the challenge of generating detailed captions for object trajectories in videos, which is important for applications in video understanding and analysis, though it appears incremental as it builds on existing datasets and methods.

The paper tackles the problem of dense video object captioning by proposing MaskCaptioner, an end-to-end model that jointly detects, segments, tracks, and captions object trajectories, achieving state-of-the-art results on benchmarks like VidSTG, VLN, and BenSMOT.

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to disjoint training strategies, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM. By extending the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap), we train MaskCaptioner, an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories. Moreover, with pretraining on LVISCap and LV-VISCap, MaskCaptioner achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/maskcaptioner/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes