CV ASSep 2, 2020

Seeing wake words: Audio-visual Keyword Spotting

Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, Andrew Zisserman

arXiv:2009.01225v114.047 citations

Originality Highly original

AI Analysis

This addresses the problem of robust keyword spotting in noisy or audio-absent environments for applications like assistive technologies or surveillance.

The paper tackles the problem of detecting when a specific word is spoken by a talking face, with or without audio, using a zero-shot method for in-the-wild videos. It introduces KWS-Net, which improves performance over previous state-of-the-art visual keyword spotting and lip reading methods, and shows generalization to French and German with less language-specific data.

The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for in the wild videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) pattern detection, to decide whether the word is there and when; (2) we demonstrate that if audio is available, visual keyword spotting improves the performance both for a clean and noisy audio signal. Finally, (3) we show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data, by fine-tuning the network pre-trained on English. The method exceeds the performance of the previous state-of-the-art visual keyword spotting architecture when trained and tested on the same benchmark, and also that of a state-of-the-art lip reading method.

View on arXiv PDF

Similar