Deep Lip Reading: a comparison of models and an online application
This work addresses lip reading for visual speech recognition, with incremental improvements in accuracy and real-time application.
The paper tackled lip reading by developing and comparing three models (LSTM, fully convolutional, and transformer), with the best model improving the state-of-the-art word error rate on the LRS2 benchmark by over 20%.
The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition. We develop three architectures and compare their accuracy and training times: (i) a recurrent model using LSTMs; (ii) a fully convolutional model; and (iii) the recently proposed transformer model. The recurrent and fully convolutional models are trained with a Connectionist Temporal Classification loss and use an explicit language model for decoding, the transformer is a sequence-to-sequence model. Our best performing model improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent. As a further contribution we investigate the fully convolutional model when used for online (real time) lip reading of continuous speech, and show that it achieves high performance with low latency.