ROCVJul 18, 2024

Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

arXiv:2407.13856v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This enables robots to navigate to task-relevant areas using natural language, addressing a domain-specific need in robotics.

The paper tackles the limitation of Vision-Language Models (VLMs) in reasoning beyond visible objects by extending them with spatial localization from egocentric video to predict task affordances and locations, showing reduced error in both predictions compared to a baseline.

Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models are limited to reasoning over objects and actions currently visible on the image plane. We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations to augment VLMs in two ways -- through understanding spatial task-affordances, i.e. where an agent must be for the task to physically take place, and the localization of that task relative to the egocentric viewer. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting representation will enable robots to use egocentric sensing to navigate to, or around, physical regions of interest for novel tasks specified in natural language.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes