Dialogue Quality and Emotion Annotations for Customer Support Conversations
This addresses the problem of uncertain generalizability of LLMs in multilingual and domain-specific dialogue applications for researchers and developers in conversational AI.
The paper tackles the lack of benchmarking datasets for evaluating large language models in bilingual customer support conversations by presenting a holistic annotation approach for emotion and conversational quality. It provides benchmarks for emotion recognition and dialogue quality estimation, concluding that further research is needed for production use.
Task-oriented conversational datasets often lack topic variability and linguistic diversity. However, with the advent of Large Language Models (LLMs) pretrained on extensive, multilingual and diverse text data, these limitations seem overcome. Nevertheless, their generalisability to different languages and domains in dialogue applications remains uncertain without benchmarking datasets. This paper presents a holistic annotation approach for emotion and conversational quality in the context of bilingual customer support conversations. By performing annotations that take into consideration the complete instances that compose a conversation, one can form a broader perspective of the dialogue as a whole. Furthermore, it provides a unique and valuable resource for the development of text classification models. To this end, we present benchmarks for Emotion Recognition and Dialogue Quality Estimation and show that further research is needed to leverage these models in a production setting.