"Set It Up!": Functional Object Arrangement with Compositional Generative Models
This addresses the challenge of robotic task planning with vague human instructions, though it is incremental as it builds on prior arrangement methods by adding compositional generative models.
The paper tackles the problem of enabling robots to interpret under-specified instructions for functional object arrangements, such as 'set up a dining table for two', by introducing the SetItUp framework, which uses LLMs and diffusion models to generate arrangements that outperform existing models in physical plausibility, functionality, and aesthetics.
This paper studies the challenge of developing robots capable of understanding under-specified instructions for creating functional object arrangements, such as "set up a dining table for two"; previous arrangement approaches have focused on much more explicit instructions, such as "put object A on the table." We introduce a framework, SetItUp, for learning to interpret under-specified instructions. SetItUp takes a small number of training examples and a human-crafted program sketch to uncover arrangement rules for specific scene types. By leveraging an intermediate graph-like representation of abstract spatial relationships among objects, SetItUp decomposes the arrangement problem into two subproblems: i) learning the arrangement patterns from limited data and ii) grounding these abstract relationships into object poses. SetItUp leverages large language models (LLMs) to propose the abstract spatial relationships among objects in novel scenes as the constraints to be satisfied; then, it composes a library of diffusion models associated with these abstract relationships to find object poses that satisfy the constraints. We validate our framework on a dataset comprising study desks, dining tables, and coffee tables, with the results showing superior performance in generating physically plausible, functional, and aesthetically pleasing object arrangements compared to existing models.