Hierarchical Instruction-aware Embodied Visual Tracking
This addresses the problem of efficient and generalizable embodied visual tracking for robotics or AI systems, though it appears incremental as it builds on existing language and vision models.
The paper tackles the challenge of bridging high-level user instructions with low-level agent actions in User-Centric Embodied Visual Tracking (UC-EVT) by proposing HIEVT, which uses spatial goals as intermediaries to improve instruction comprehension and action generation. The method demonstrates robustness and generalizability across diverse environments, with experiments based on over ten million training trajectories and evaluation in one seen and nine unseen environments.
User-Centric Embodied Visual Tracking (UC-EVT) presents a novel challenge for reinforcement learning-based models due to the substantial gap between high-level user instructions and low-level agent actions. While recent advancements in language models (e.g., LLMs, VLMs, VLAs) have improved instruction comprehension, these models face critical limitations in either inference speed (LLMs, VLMs) or generalizability (VLAs) for UC-EVT tasks. To address these challenges, we propose \textbf{Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT)} agent, which bridges instruction comprehension and action generation using \textit{spatial goals} as intermediaries. HIEVT first introduces \textit{LLM-based Semantic-Spatial Goal Aligner} to translate diverse human instructions into spatial goals that directly annotate the desired spatial position. Then the \textit{RL-based Adaptive Goal-Aligned Policy}, a general offline policy, enables the tracker to position the target as specified by the spatial goal. To benchmark UC-EVT tasks, we collect over ten million trajectories for training and evaluate across one seen environment and nine unseen challenging environments. Extensive experiments and real-world deployments demonstrate the robustness and generalizability of HIEVT across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at https://sites.google.com/view/hievt.