ROAIJan 20

DroneVLA: VLA based Aerial Manipulation

arXiv:2601.13809v21 citationsh-index: 24
AI Analysis

This addresses the challenge of intuitive human-drone interaction for aerial manipulation, though it appears incremental by integrating existing components like VLA models and planning algorithms.

The paper tackles the problem of enabling non-expert users to command aerial manipulation systems via natural language, introducing a system that interprets high-level commands to retrieve and deliver objects with real-world experiments showing localization and navigation errors of 0.164m max, 0.070m mean euclidean, and 0.084m root-mean squared.

As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes