CV CLOct 29, 2021

Visual Keyword Spotting with Attention

K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman

arXiv:2110.15957v16.516 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses visual keyword spotting for applications like lip reading and sign language analysis, representing an incremental improvement with a novel architecture.

The paper tackles the problem of spotting spoken keywords in silent video sequences by proposing a Transformer-based model called Transpotter, which uses cross-modal attention between visual and phonetic streams and achieves state-of-the-art performance on LRW, LRS2, and LRS3 datasets.

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

View on arXiv PDF Code

Similar