ROCVNov 23, 2025

AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

arXiv:2511.18617v24 citations
Originality Highly original
AI Analysis

This work addresses the challenge of reducing annotation costs and improving policy performance in visual imitation learning for robotics and autonomous systems, representing a novel approach rather than an incremental improvement.

AutoFocus-IL tackles the problem of data inefficiency and poor generalization in visual imitation learning by using vision-language models to automatically generate saliency maps from demonstrations, eliminating the need for costly human annotations. The method outperforms standard behavior cloning and state-of-the-art baselines that rely on human supervision, as demonstrated in CARLA simulator and real-robot manipulation tasks.

AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes