Is one brick enough to break the wall of spoken dialogue state tracking?
This addresses the issue of cascading errors and separate optimization in traditional systems for improving dialogue interactions, though it appears incremental as it builds on end-to-end methods.
The paper tackles the problem of spoken dialogue state tracking in task-oriented dialogue systems by proposing a novel completely neural approach, showing it is competitive with state-of-the-art cascade methods, especially in audio-native settings.
In Task-Oriented Dialogue (TOD) systems, correctly updating the system's understanding of the user's requests (\textit{a.k.a} dialogue state tracking) is key to a smooth interaction. Traditionally, TOD systems perform this update in three steps: transcription of the user's utterance, semantic extraction of the key concepts, and contextualization with the previously identified concepts. Such cascade approaches suffer from cascading errors and separate optimization. End-to-End approaches have been proven helpful up to the turn-level semantic extraction step. This paper goes one step further and provides (1) a novel approach for completely neural spoken DST, (2) an in depth comparison with a state of the art cascade approach and (3) avenues towards better context propagation. Our study highlights that jointly-optimized approaches are also competitive for contextually dependent tasks, such as Dialogue State Tracking (DST), especially in audio native settings. Context propagation in DST systems could benefit from training procedures accounting for the previous' context inherent uncertainty.