ASCLSDOct 17, 2024

On the Use of Audio to Improve Dialogue Policies

arXiv:2410.13385v1h-index: 2IberSPEECH
Originality Incremental advance
AI Analysis

This work addresses the limitation of ignoring extralinguistic audio information in spoken dialogue systems, particularly beneficial in noisy transcription scenarios.

The paper tackles the problem of dialogue policies relying solely on text transcriptions by proposing architectures that combine audio and text embeddings, resulting in a 9.8% relative improvement in User Request Score on the DSTC2 dataset.

With the significant progress of speech technologies, spoken goal-oriented dialogue systems are becoming increasingly popular. One of the main modules of a dialogue system is typically the dialogue policy, which is responsible for determining system actions. This component usually relies only on audio transcriptions, being strongly dependent on their quality and ignoring very important extralinguistic information embedded in the user's speech. In this paper, we propose new architectures to add audio information by combining speech and text embeddings using a Double Multi-Head Attention component. Our experiments show that audio embedding-aware dialogue policies outperform text-based ones, particularly in noisy transcription scenarios, and that how text and audio embeddings are combined is crucial to improve performance. We obtained a 9.8% relative improvement in the User Request Score compared to an only-text-based dialogue system on the DSTC2 dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes