ROCVAug 25, 2025

Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

arXiv:2508.17922v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the need for task-specific affordance prediction in robotics, which is incremental as it builds on existing affordance concepts by adding instruction dependency.

The paper tackles the problem of instruction-dependent affordance prediction for robot manipulation by introducing a new dataset of 15,000 object-instruction-affordance triplets and a method using large multimodal models with a 'search against verifiers' pipeline, achieving outstanding performance.

Affordance is crucial for intelligent robots in the context of object manipulation. In this paper, we argue that affordance should be task-/instruction-dependent, which is overlooked by many previous works. That is, different instructions can lead to different manipulation regions and directions even for the same object. According to this observation, we present a new dataset comprising fifteen thousand object-instruction-affordance triplets. All scenes in the dataset are from an egocentric viewpoint, designed to approximate the perspective of a human-like robot. Furthermore, we investigate how to enable large multimodal models (LMMs) to serve as affordance predictors by implementing a ``search against verifiers'' pipeline. An LMM is asked to progressively predict affordances, with the output at each step being verified by itself during the iterative process, imitating a reasoning process. Experiments show that our method not only unlocks new instruction-oriented affordance prediction capabilities, but also achieves outstanding performance broadly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes