End2End Acoustic to Semantic Transduction
This work addresses spoken language understanding for applications like voice assistants, but it is incremental as it builds on existing attention-based methods with specific improvements.
The paper tackles the problem of spoken language understanding by proposing an end-to-end sequence-to-sequence model that directly transduces acoustic features to semantic contents, achieving a 13.6 concept error rate (CER) and 18.5 concept value error rate (CVER) on the French MEDIA corpus, with a 2.8-point absolute reduction compared to state-of-the-art.
In this paper, we propose a novel end-to-end sequence-to-sequence spoken language understanding model using an attention mechanism. It reliably selects contextual acoustic features in order to hypothesize semantic contents. An initial architecture capable of extracting all pronounced words and concepts from acoustic spans is designed and tested. With a shallow fusion language model, this system reaches a 13.6 concept error rate (CER) and an 18.5 concept value error rate (CVER) on the French MEDIA corpus, achieving an absolute 2.8 points reduction compared to the state-of-the-art. Then, an original model is proposed for hypothesizing concepts and their values. This transduction reaches a 15.4 CER and a 21.6 CVER without any new type of context.