IVCVASMay 29, 2018

Can DNNs Learn to Lipread Full Sentences?

arXiv:1805.11685v19 citations
Originality Incremental advance
AI Analysis

This addresses the problem of automated lipreading for complex, unconstrained speech, offering a significant but incremental advance in visual speech recognition.

The paper tackled the challenge of lipreading full sentences by exploring state-of-the-art Deep Neural Network architectures, reporting a major improvement over Hidden Markov Model frameworks on the TCD-TIMIT dataset with 59 speakers and over 6000 words.

Finding visual features and suitable models for lipreading tasks that are more complex than a well-constrained vocabulary has proven challenging. This paper explores state-of-the-art Deep Neural Network architectures for lipreading based on a Sequence to Sequence Recurrent Neural Network. We report results for both hand-crafted and 2D/3D Convolutional Neural Network visual front-ends, online monotonic attention, and a joint Connectionist Temporal Classification-Sequence-to-Sequence loss. The system is evaluated on the publicly available TCD-TIMIT dataset, with 59 speakers and a vocabulary of over 6000 words. Results show a major improvement on a Hidden Markov Model framework. A fuller analysis of performance across visemes demonstrates that the network is not only learning the language model, but actually learning to lipread.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes