SceneScore: Learning a Cost Function for Object Arrangement
This addresses the challenge of enabling robots to create human-like arrangements without environment interaction or human supervision, though it appears incremental as it builds on existing energy-based models and graph neural networks.
The paper tackles the problem of evaluating object arrangements for robots by learning a cost function called SceneScore from example images, enabling tasks like predicting poses for missing objects and generalizing to novel objects with semantic features.
Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.