CVSep 29, 2020

Lip-reading with Densely Connected Temporal Convolutional Networks

arXiv:2009.14233v377 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving lip-reading accuracy for isolated words, which is incremental as it builds on existing Temporal Convolutional Networks with enhancements.

The paper tackled lip-reading of isolated words by introducing a Densely Connected Temporal Convolutional Network (DC-TCN) with Squeeze-and-Excitation blocks, achieving 88.36% accuracy on the LRW dataset and 43.65% on the LRW-1000 dataset, surpassing all baselines and setting new state-of-the-art results.

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes