CVJun 21, 2018

Learning Multimodal Representations for Unseen Activities

arXiv:1806.08251v48.311 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of activity recognition in videos for applications like surveillance or content analysis, but it is incremental as it builds on existing multimodal representation learning with an adversarial formulation.

The paper tackles the problem of recognizing unseen activities in videos by learning a joint multimodal representation space using paired and unpaired text and video data, resulting in improved performance on zero-shot classification, unsupervised discovery, and captioning tasks, outperforming state-of-the-art methods.

We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos. We first compare the effect of placing various constraints on the embedding space using paired text and video data. We also propose a method to improve the joint embedding space using an adversarial formulation, allowing it to benefit from unpaired text and video data. By using unpaired text data, we show the ability to learn a representation that better captures unseen activities. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that using paired and unpaired data to learn a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning, outperforming the state-of-the-arts.

View on arXiv PDF Code

Similar