Exploring Textual and Speech information in Dialogue Act Classification with Speaker Domain Adaptation
This work addresses a practical issue for spoken dialog systems by improving classification under noisy conditions, though it is incremental in nature.
The paper tackles the problem of Dialogue Act classification in spoken dialog systems by exploring multimodal models that combine acoustic and textual signals, including noisy ASR transcriptions, and investigates speaker domain adaptation. The results show that the multimodal model outperforms unimodal models, especially when oracle transcriptions are unavailable, and the proposed adaptation method achieves competitive performance.
In spite of the recent success of Dialogue Act (DA) classification, the majority of prior works focus on text-based classification with oracle transcriptions, i.e. human transcriptions, instead of Automatic Speech Recognition (ASR)'s transcriptions. In spoken dialog systems, however, the agent would only have access to noisy ASR transcriptions, which may further suffer performance degradation due to domain shift. In this paper, we explore the effectiveness of using both acoustic and textual signals, either oracle or ASR transcriptions, and investigate speaker domain adaptation for DA classification. Our multimodal model proves to be superior to the unimodal models, particularly when the oracle transcriptions are not available. We also propose an effective method for speaker domain adaptation, which achieves competitive results.