Weakly Supervised Learning of Affordances
This work addresses scene understanding for robotics or AI by enabling affordance localization with reduced supervision, though it is incremental in applying existing methods to a new dataset.
The paper tackles affordance segmentation by framing it as semantic image segmentation and introduces a pixel-annotated dataset of 3090 images with 9916 object instances. It uses a deep convolutional neural network with an expectation maximization framework to leverage weakly labeled data, showing that minimal supervision with human pose context reduces performance loss.
Localizing functional regions of objects or affordances is an important aspect of scene understanding. In this work, we cast the problem of affordance segmentation as that of semantic image segmentation. In order to explore various levels of supervision, we introduce a pixel-annotated affordance dataset of 3090 images containing 9916 object instances with rich contextual information in terms of human-object interactions. We use a deep convolutional neural network within an expectation maximization framework to take advantage of weakly labeled data like image level annotations or keypoint annotations. We show that a further reduction in supervision is possible with a minimal loss in performance when human pose is used as context.