LG CV SD AS IVSep 1, 2021

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

Hugo Carneiro, Cornelius Weber, Stefan Wermter

arXiv:2109.00577v15.59 citations

Originality Incremental advance

AI Analysis

This work addresses ambiguous speaker detection in multi-person scenes for video analysis systems, representing an incremental improvement.

The paper tackled ambiguous active speaker detection by incorporating a face-voice association neural network into an existing model, resulting in FaVoA, which correctly classifies challenging scenarios and rules out non-matching associations.

The strong relation between face and voice can aid active speaker detection systems when faces are visible, even in difficult settings, when the face of a speaker is not clear or when there are several people in the same scene. By being capable of estimating the frontal facial representation of a person from his/her speech, it becomes easier to determine whether he/she is a potential candidate for being classified as an active speaker, even in challenging cases in which no mouth movement is detected from any person in that same scene. By incorporating a face-voice association neural network into an existing state-of-the-art active speaker detection model, we introduce FaVoA (Face-Voice Association Ambiguous Speaker Detector), a neural network model that can correctly classify particularly ambiguous scenarios. FaVoA not only finds positive associations, but helps to rule out non-matching face-voice associations, where a face does not match a voice. Its use of a gated-bimodal-unit architecture for the fusion of those models offers a way to quantitatively determine how much each modality contributes to the classification.

View on arXiv PDF

Similar