ROMar 8

AeroPlace-Flow: Language-Grounded Object Placement for Aerial Manipulators via Visual Foresight and Object Flow

Sarthak Mishra, Rishabh Dev Yadav, Naveen Nair, Wei Pan, Spandan Roy

arXiv:2603.07744v18.55 citationsh-index: 7

Predicted impact top 48% in RO · last 90 daysOriginality Highly original

AI Analysis

This work provides a training-free framework for language-grounded object placement for aerial manipulators, addressing the cumbersome nature of specifying exact placement poses for real-world users.

This paper addresses the challenge of precise object placement for aerial manipulators, where users provide natural language instructions instead of exact coordinates. The proposed AeroPlace-Flow framework achieves this by synthesizing a goal image from language, grounding it in 3D, and inferring a collision-aware object flow, resulting in a 75% success rate on hardware.

Precise object placement remains underexplored in aerial manipulation, where most systems rely on predefined target coordinates and focus primarily on grasping and control. Specifying exact placement poses, however, is cumbersome in real-world settings, where users naturally communicate goals through language. In this work, we present AeroPlace-Flow, a training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow. Given RGB-D observations of the object and the placement scene, along with a natural language instruction, AeroPlace-Flow first synthesizes a task-complete goal image using image editing models. The imagined configuration is then grounded into metric 3D space through depth alignment and object-centric reasoning, enabling the inference of a collision-aware object flow that transports the grasped object to a language and contact-consistent placement configuration. The resulting motion is executed via standard trajectory tracking for an aerial manipulator. AeroPlace-Flow produces executable placement targets without requiring predefined poses or task-specific training. We validate our approach through extensive simulation and real-world experiments, demonstrating reliable language-conditioned placement across diverse aerial scenarios with an average success rate of 75% on hardware.

View on arXiv PDF

Similar