Object and Text-guided Semantics for CNN-based Activity Recognition
This work addresses activity recognition for video analysis by integrating object and text semantics, offering an incremental improvement over existing methods.
The paper tackles video-based human activity recognition by co-learning object recognition and using text-guided semantics to select relevant objects, improving baseline performance with a novel CNN approach.
Many previous methods have demonstrated the importance of considering semantically relevant objects for carrying out video-based human activity recognition, yet none of the methods have harvested the power of large text corpora to relate the objects and the activities to be transferred into learning a unified deep convolutional neural network. We present a novel activity recognition CNN which co-learns the object recognition task in an end-to-end multitask learning scheme to improve upon the baseline activity recognition performance. We further improve upon the multitask learning approach by exploiting a text-guided semantic space to select the most relevant objects with respect to the target activities. To the best of our knowledge, we are the first to investigate this approach.