Learning Using Privileged Information for Zero-Shot Action Recognition
This work addresses the challenge of recognizing unseen video actions, which is important for applications like video analysis, but it is incremental as it builds on existing zero-shot learning frameworks.
The paper tackles the problem of zero-shot action recognition by using object semantics as privileged information to narrow the semantic gap between visual and semantic spaces, resulting in state-of-the-art performance on Olympic Sports, HMDB51, and UCF101 datasets.
Zero-Shot Action Recognition (ZSAR) aims to recognize video actions that have never been seen during training. Most existing methods assume a shared semantic space between seen and unseen actions and intend to directly learn a mapping from a visual space to the semantic space. This approach has been challenged by the semantic gap between the visual space and semantic space. This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap and, hence, effectively, assist the learning. In particular, a simple hallucination network is proposed to implicitly extract object semantics during testing without explicitly extracting objects and a cross-attention module is developed to augment visual feature with the object semantics. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.