CVLGMay 5

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

arXiv:2605.0365067.8
Predicted impact top 46% in CV · last 90 daysOriginality Incremental advance
AI Analysis

It simplifies video object-centric learning by eliminating learned dynamics, offering a computationally cheaper alternative for maintaining temporal consistency.

The paper shows that learned temporal prediction in video object-centric learning can be replaced by deterministic bipartite matching on frozen self-supervised features, achieving competitive performance on MOVi-D, MOVi-E, and YouTube-VIS with zero learnable parameters for temporal modeling.

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes