Deep Multimodal Speaker Naming
This addresses the challenge of identifying speakers in complex video scenes for applications like TV/movie analysis, though it is incremental as it builds on multimodal approaches.
The paper tackled the problem of automatic speaker naming in videos by proposing a CNN-based framework that learns to fuse face and audio cues, achieving state-of-the-art performance on two TV series without relying on face tracking or subtitles.
Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online.