CVMar 12, 2017

Combining Residual Networks with LSTMs for Lipreading

arXiv:1703.04105v4346 citations
Originality Incremental advance
AI Analysis

This work addresses lipreading for improved speech recognition in noisy environments, representing a strong incremental advance in a domain-specific area.

The paper tackles word-level visual speech recognition by proposing an end-to-end deep learning architecture combining spatiotemporal convolutional, residual, and bidirectional LSTM networks, achieving 83.0% word accuracy on the Lipreading In-The-Wild benchmark, a 6.8% absolute improvement over the state-of-the-art without using word boundary information.

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes