CVCLDec 21, 2018

An Empirical Analysis of Deep Audio-Visual Models for Speech Recognition

arXiv:1812.09336v1
Originality Synthesis-oriented
AI Analysis

This is an incremental study on audio-visual speech recognition for improving robustness in noisy environments.

The authors tackled speech recognition by predicting words from video and audio, re-implementing and modifying a state-of-the-art model to test attention mechanisms, residual networks, and noise sensitivity.

In this project, we worked on speech recognition, specifically predicting individual words based on both the video frames and audio. Empowered by convolutional neural networks, the recent speech recognition and lip reading models are comparable to human level performance. We re-implemented and made derivations of the state-of-the-art model. Then, we conducted rich experiments including the effectiveness of attention mechanism, more accurate residual network as the backbone with pre-trained weights and the sensitivity of our model with respect to audio input with/without noise.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes