CVIVDec 28, 2020

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

arXiv:2012.14360v16 citations
AI Analysis

This work provides an incremental improvement in word-level lip-reading accuracy for general applications.

This paper addresses word-level lip-reading by proposing a new deep learning architecture. The model achieved 86.83% accuracy on the LRW dataset, representing a 1.53% absolute improvement over the previous state-of-the-art.

In this paper, we propose a novel deep learning architecture to improving word-level lip-reading. On the one hand, we first introduce the multi-scale processing into the spatial feature extraction for lip-reading. Specially, we proposed hierarchical pyramidal convolution (HPConv) to replace the standard convolution in original module, leading to improvements over the model's ability to discover fine-grained lip movements. On the other hand, we merge information in all time steps of the sequence by utilizing self-attention, to make the model pay more attention to the relevant frames. These two advantages are combined together to further enhance the model's classification power. Experiments on the Lip Reading in the Wild (LRW) dataset show that our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art. We also conducted extensive experiments to better understand the behavior of the proposed model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes