CVCLOct 29, 2021

Visual Keyword Spotting with Attention

arXiv:2110.15957v116 citations
Originality Incremental advance
AI Analysis

This work addresses visual keyword spotting for applications like lip reading and sign language analysis, representing an incremental improvement with a novel architecture.

The paper tackles the problem of spotting spoken keywords in silent video sequences by proposing a Transformer-based model called Transpotter, which uses cross-modal attention between visual and phonetic streams and achieves state-of-the-art performance on LRW, LRS2, and LRS3 datasets.

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes