CVJun 20, 2014

Predicting Motivations of Actions by Leveraging Text

Carl Vondrick, Deniz Oktay, Hamed Pirsiavash, Antonio Torralba

arXiv:1406.5472v243 citations

AI Analysis

This addresses the challenge of human activity understanding for applications like action anticipation or explanation, but it is incremental as it builds on existing methods with a new dataset.

The paper tackles the problem of predicting why a person performs an action in images, a step beyond action recognition, by leveraging natural language models to mine knowledge from text. The results suggest that transferring knowledge from language to vision helps machines infer motivations, though full understanding remains distant.

Understanding human actions is a key problem in computer vision. However, recognizing actions is only the first step of understanding what a person is doing. In this paper, we introduce the problem of predicting why a person has performed an action in images. This problem has many applications in human activity understanding, such as anticipating or explaining an action. To study this problem, we introduce a new dataset of people performing actions annotated with likely motivations. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their lifetime of experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from fully understanding motivation, our results suggest that transferring knowledge from language into vision can help machines understand why people in images might be performing an action.

View on arXiv PDF

Similar