Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models
This work addresses the critical need for high-precision breakdown detection in industry settings like healthcare, where real-time corrective action is essential for task completion, though it appears incremental as it builds on multimodal approaches.
The paper tackles the problem of detecting dialogue breakdown in conversational AI systems by introducing a multimodal contextual model that processes both audio and transcribed text, achieving an F1 score of 69.27 and outperforming existing best models.
Detecting dialogue breakdown in real time is critical for conversational AI systems, because it enables taking corrective action to successfully complete a task. In spoken dialog systems, this breakdown can be caused by a variety of unexpected situations including high levels of background noise, causing STT mistranscriptions, or unexpected user flows. In particular, industry settings like healthcare, require high precision and high flexibility to navigate differently based on the conversation history and dialogue states. This makes it both more challenging and more critical to accurately detect dialog breakdown. To accurately detect breakdown, we found it requires processing audio inputs along with downstream NLP model inferences on transcribed text in real time. In this paper, we introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.