Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering
This work addresses the challenge of understanding human actions and interactions in images, with applications in visual question answering, but it is incremental as it builds on existing deep learning methods.
The paper tackles the problem of predicting human activity labels in still images using deep convolutional networks with local and global context, achieving state-of-the-art performance on two datasets with hundreds of labels. It also shows that features trained on these datasets improve accuracy on Visual Question Answering tasks, specifically for person activity and person-object relationship questions, outperforming generic ImageNet features.
This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.