CVMay 17, 2024

Top-Down Guidance for Learning Object-Centric Representations

arXiv:2405.10598v32 citationsh-index: 31IJCAI
Originality Incremental advance
AI Analysis

This work addresses the limitation of current OCL models in distinguishing objects for AI systems, enabling better performance in complex downstream tasks, though it appears incremental as it builds on existing OCL frameworks.

The paper tackles the problem of suboptimal object-centric representations in Object-Centric Learning (OCL) by proposing TDGNet, which uses a top-down pathway to improve these representations, resulting in outperformance on multiple datasets and validation in robotics tasks like video prediction and visual planning.

Humans' innate ability to decompose scenes into objects allows for efficient understanding, predicting, and planning. In light of this, Object-Centric Learning (OCL) attempts to endow networks with similar capabilities, learning to represent scenes with the composition of objects. However, existing OCL models only learn through reconstructing the input images, which does not assist the model in distinguishing objects, resulting in suboptimal object-centric representations. This flaw limits current object-centric models to relatively simple downstream tasks. To address this issue, we draw on humans' top-down vision pathway and propose Top-Down Guided Network (TDGNet), which includes a top-down pathway to improve object-centric representations. During training, the top-down pathway constructs guidance with high-level object-centric representations to optimize low-level grid features output by the backbone. While during inference, it refines object-centric representations by detecting and solving conflicts between low- and high-level features. We show that TDGNet outperforms current object-centric models on multiple datasets of varying complexity. In addition, we expand the downstream task scope of object-centric representations by applying TDGNet to the field of robotics, validating its effectiveness in downstream tasks including video prediction and visual planning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes