Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale
This work addresses the challenge of enabling intelligent agents to understand and interact with environments through fine-grained affordance grounding, though it is incremental as it builds on existing benchmarks and methods.
The authors tackled the problem of affordance grounding, which involves localizing object regions based on natural language descriptions, by introducing Affogato, a large-scale benchmark with 150K instances and 3D affordance heatmaps, and developed vision-language models that achieve promising performance and open-vocabulary cross-domain generalization.
Affordance grounding-localizing object regions based on natural language descriptions of interactions-is a critical challenge for enabling intelligent agents to understand and interact with their environments. However, this task remains challenging due to the need for fine-grained part-level localization, the ambiguity arising from multiple valid interaction regions, and the scarcity of large-scale datasets. In this work, we introduce Affogato, a large-scale benchmark comprising 150K instances, annotated with open-vocabulary text descriptions and corresponding 3D affordance heatmaps across a diverse set of objects and interactions. Building on this benchmark, we develop simple yet effective vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder. Our models trained with the Affogato dataset achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization. The Affogato dataset is shared in public: https://huggingface.co/datasets/project-affogato/affogato