AIAug 26, 2021

Disentangling What and Where for 3D Object-Centric Representations Through Active Inference

Toon Van de Maele, Tim Verbelen, Ozan Catal, Bart Dhoedt

arXiv:2108.11762v16.15 citations

Originality Incremental advance

AI Analysis

This addresses the need for more flexible and robust AI systems in computer vision, particularly for 3D object tasks, though it appears incremental by building on active inference and object-centric approaches.

The paper tackles the problem of object detection and classification models being inflexible to novel categories and ambiguous viewpoints by proposing an active inference agent with object-centric generative models. The agent learns object categories unsupervised, achieves state-of-the-art classification accuracies, and identifies novel categories, with validation in an end-to-end search task.

Although modern object detection and classification models achieve high accuracy, these are typically constrained in advance on a fixed train set and are therefore not flexible to deal with novel, unseen object categories. Moreover, these models most often operate on a single frame, which may yield incorrect classifications in case of ambiguous viewpoints. In this paper, we propose an active inference agent that actively gathers evidence for object classifications, and can learn novel object categories over time. Drawing inspiration from the human brain, we build object-centric generative models composed of two information streams, a what- and a where-stream. The what-stream predicts whether the observed object belongs to a specific category, while the where-stream is responsible for representing the object in its internal 3D reference frame. We show that our agent (i) is able to learn representations for many object categories in an unsupervised way, (ii) achieves state-of-the-art classification accuracies, actively resolving ambiguity when required and (iii) identifies novel object categories. Furthermore, we validate our system in an end-to-end fashion where the agent is able to search for an object at a given pose from a pixel-based rendering. We believe that this is a first step towards building modular, intelligent systems that can be used for a wide range of tasks involving three dimensional objects.

View on arXiv PDF

Similar