Zero-Shot Activity Recognition with Videos
This work addresses the problem of recognizing unseen activities in videos for computer vision applications, representing an incremental advancement in zero-shot learning.
The paper tackled zero-shot activity recognition from videos by introducing an auto-encoder model to create a multimodal joint embedding space between visual and textual features, achieving results evaluated with top-n accuracy and mean Nearest Neighbor Overlap metrics.
In this paper, we examined the zero-shot activity recognition task with the usage of videos. We introduce an auto-encoder based model to construct a multimodal joint embedding space between the visual and textual manifolds. On the visual side, we used activity videos and a state-of-the-art 3D convolutional action recognition network to extract the features. On the textual side, we worked with GloVe word embeddings. The zero-shot recognition results are evaluated by top-n accuracy. Then, the manifold learning ability is measured by mean Nearest Neighbor Overlap. In the end, we provide an extensive discussion over the results and the future directions.