Deep Multimodal Feature Encoding for Video Ordering
This addresses the challenge of joint video analysis across modalities for applications like video understanding, though it appears incremental in combining existing modalities with a new proxy task.
The paper tackles the problem of learning compact multimodal feature representations from videos by training on a proxy task of inferring temporal ordering of unordered video clips, using a new dataset of approximately 30K scenes. The result demonstrates that multimodal representations are complementary and improve performance on tasks like temporal ordering and action recognition.
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the "Large Scale Movie Description Challenge". We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition. We demonstrate empirically that multimodal representations are indeed complementary, and can play a key role in improving the performance of many applications.