ACD: Action Concept Discovery from Image-Sentence Corpora
This addresses the problem of scaling action classification without manual labeling for computer vision researchers, though it is incremental as it builds on existing methods like deep networks and word embeddings.
The paper tackles the challenge of action classification in still images by automatically discovering and learning over a hundred human action concept classifiers from weakly supervised image-sentence corpora, achieving promising classification results on the PASCAL VOC 2012 benchmark.
Action classification in still images is an important task in computer vision. It is challenging as the appearances of ac- tions may vary depending on their context (e.g. associated objects). Manually labeling of context information would be time consuming and difficult to scale up. To address this challenge, we propose a method to automatically discover and cluster action concepts, and learn their classifiers from weakly supervised image-sentence corpora. It obtains candidate action concepts by extracting verb-object pairs from sentences and verifies their visualness with the associated images. Candidate action concepts are then clustered by using a multi-modal representation with image embeddings from deep convolutional networks and text embeddings from word2vec. More than one hundred human action concept classifiers are learned from the Flickr 30k dataset with no additional human effort and promising classification results are obtained. We further apply the AdaBoost algorithm to automatically select and combine relevant action concepts given an action query. Promising results have been shown on the PASCAL VOC 2012 action classification benchmark, which has zero overlap with Flickr30k.