CVApr 17, 2016

Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance

arXiv:1604.04842v19 citations
Originality Incremental advance
AI Analysis

This addresses the need for vision systems to understand person-object interactions in novel scenarios, which is incremental as it builds on existing interaction modeling but extends to unseen actions and objects.

The paper tackles the problem of localizing the object of a person's action (the interactee) in novel images, proposing a method to predict saliency maps for interactees based on pose, gaze, and scene cues, and demonstrates utility in tasks like object detection and image retargeting on a dataset of over 10,000 images.

Understanding images with people often entails understanding their \emph{interactions} with other objects or people. As such, given a novel image, a vision system ought to infer which other objects/people play an important role in a given person's activity. However, existing methods are limited to learning action-specific interactions (e.g., how the pose of a tennis player relates to the position of his racquet when serving the ball) for improved recognition, making them unequipped to reason about novel interactions with actions or objects unobserved in the training data. We propose to predict the "interactee" in novel images---that is, to localize the \emph{object} of a person's action. Given an arbitrary image with a detected person, the goal is to produce a saliency map indicating the most likely positions and scales where that person's interactee would be found. To that end, we explore ways to learn the generic, action-independent connections between (a) representations of a person's pose, gaze, and scene cues and (b) the interactee object's position and scale. We provide results on a newly collected UT Interactee dataset spanning more than 10,000 images from SUN, PASCAL, and COCO. We show that the proposed interaction-informed saliency metric has practical utility for four tasks: contextual object detection, image retargeting, predicting object importance, and data-driven natural language scene description. All four scenarios reveal the value in linking the subject to its object in order to understand the story of an image.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes