CVAug 27, 2025

Self-supervised structured object representation learning

arXiv:2508.19864v1h-index: 5ISVC
Originality Incremental advance
AI Analysis

This work addresses the problem of improving structured object representation learning in SSL for computer vision researchers and practitioners, offering incremental advances over existing methods like DINO.

The paper tackled the problem of self-supervised learning (SSL) being limited in capturing structured visual representations for scenes, and proposed a method that combines semantic grouping, instance separation, and hierarchical structuring to learn object-centric representations, resulting in enhanced supervised object detection that outperforms state-of-the-art methods on datasets like COCO and UA-DETRAC, even with limited annotated data and fewer fine-tuning epochs.

Self-supervised learning (SSL) has emerged as a powerful technique for learning visual representations. While recent SSL approaches achieve strong results in global image understanding, they are limited in capturing the structured representation in scenes. In this work, we propose a self-supervised approach that progressively builds structured visual representations by combining semantic grouping, instance level separation, and hierarchical structuring. Our approach, based on a novel ProtoScale module, captures visual elements across multiple spatial scales. Unlike common strategies like DINO that rely on random cropping and global embeddings, we preserve full scene context across augmented views to improve performance in dense prediction tasks. We validate our method on downstream object detection tasks using a combined subset of multiple datasets (COCO and UA-DETRAC). Experimental results show that our method learns object centric representations that enhance supervised object detection and outperform the state-of-the-art methods, even when trained with limited annotated data and fewer fine-tuning epochs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes