CVMar 26

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

arXiv:2603.2573358.4h-index: 3
AI Analysis

This addresses the generalization issue in video temporal grounding for multimodal large language models, offering a more efficient solution than existing object-centric approaches.

The paper tackled the problem of poor out-of-domain generalization in video temporal grounding due to dataset-specific shortcuts from fine-tuning, and proposed SlotVTG, a lightweight object-centric adapter that significantly improves OOD robustness while maintaining competitive in-domain performance with minimal overhead.

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes