Weakly-supervised Latent Models for Task-specific Visual-Language Control
This addresses the need for efficient spatial grounding in hazardous environments, but it is an incremental improvement over existing methods.
The paper tackles the problem of enabling AI agents to perform precise visual-language control for tasks like autonomous inspection, where direct use of large language models achieves only 58% success. The proposed task-specific latent dynamics model improves success to 71% and generalizes to unseen inputs.
Autonomous inspection in hazardous environments requires AI agents that can interpret high-level goals and execute precise control. A key capability for such agents is spatial grounding, for example when a drone must center a detected object in its camera view to enable reliable inspection. While large language models provide a natural interface for specifying goals, using them directly for visual control achieves only 58\% success in this task. We envision that equipping agents with a world model as a tool would allow them to roll out candidate actions and perform better in spatially grounded settings, but conventional world models are data and compute intensive. To address this, we propose a task-specific latent dynamics model that learns state-specific action-induced shifts in a shared latent space using only goal-state supervision. The model leverages global action embeddings and complementary training losses to stabilize learning. In experiments, our approach achieves 71\% success and generalizes to unseen images and instructions, highlighting the potential of compact, domain-specific latent dynamics models for spatial alignment in autonomous inspection.