AS SDJul 28, 2020

Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition

Wentao Yu, Steffen Zeiler, Dorothea Kolossa

arXiv:2007.14223v13.32 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of robust speech recognition in noisy environments for applications like assistive technologies, though it is incremental in refining existing multimodal integration methods.

The paper tackled the challenge of effectively integrating audio and visual information for large-vocabulary speech recognition, finding that dynamic stream reliability indicators significantly improve accuracy when audio is distorted, with specific gains demonstrated on the LRS2 database.

For many small- and medium-vocabulary tasks, audio-visual speech recognition can significantly improve the recognition rates compared to audio-only systems. However, there is still an ongoing debate regarding the best combination strategy for multi-modal information, which should allow for the translation of these gains to large-vocabulary recognition. While an integration at the level of state-posterior probabilities, using dynamic stream weighting, is almost universally helpful for small-vocabulary systems, in large-vocabulary speech recognition, the recognition accuracy remains difficult to improve. In the following, we specifically consider the large-vocabulary task of the LRS2 database, and we investigate a broad range of integration strategies, comparing early integration and end-to-end learning with many versions of hybrid recognition and dynamic stream weighting. One aspect, which is shown to provide much benefit here, is the use of dynamic stream reliability indicators, which allow for hybrid architectures to strongly profit from the inclusion of visual information whenever the audio channel is distorted even slightly.

View on arXiv PDF

Similar