CVJun 16, 2023

OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning

Yinxuan Huang, Tonglin Chen, Zhimeng Shen, Jinghao Huang, Bin Li, Xiangyang Xue

arXiv:2306.09682v33.93 citationsh-index: 20Has Code

Originality Synthesis-oriented

AI Analysis

This provides a benchmark for evaluating object-centric learning methods on real-world scenes, addressing a domain-specific gap but is incremental as it focuses on dataset creation rather than novel algorithmic advances.

The authors tackled the scarcity of real-world datasets for object-centric learning by introducing OCTScenes, a dataset of 5000 tabletop scenes with 15 objects captured in 60 frames each, and found that state-of-the-art methods struggle to learn meaningful representations from this real-world data despite performing well on synthetic datasets.

Humans possess the cognitive ability to comprehend scenes in a compositional manner. To empower AI systems with similar capabilities, object-centric learning aims to acquire representations of individual objects from visual scenes without any supervision. Although recent advances in object-centric learning have made remarkable progress on complex synthesis datasets, there is a huge challenge for application to complex real-world scenes. One of the essential reasons is the scarcity of real-world datasets specifically tailored to object-centric learning. To address this problem, we propose a versatile real-world dataset of tabletop scenes for object-centric learning called OCTScenes, which is meticulously designed to serve as a benchmark for comparing, evaluating, and analyzing object-centric learning methods. OCTScenes contains 5000 tabletop scenes with a total of 15 objects. Each scene is captured in 60 frames covering a 360-degree perspective. Consequently, OCTScenes is a versatile benchmark dataset that can simultaneously satisfy the evaluation of object-centric learning methods based on single-image, video, and multi-view. Extensive experiments of representative object-centric learning methods are conducted on OCTScenes. The results demonstrate the shortcomings of state-of-the-art methods for learning meaningful representations from real-world data, despite their impressive performance on complex synthesis datasets. Furthermore, OCTScenes can serve as a catalyst for the advancement of existing methods, inspiring them to adapt to real-world scenes. Dataset and code are available at https://huggingface.co/datasets/Yinxuan/OCTScenes.

View on arXiv PDF

Similar