CLASDec 23, 2023

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

arXiv:2312.15316v236 citationsICASSP
Originality Incremental advance
AI Analysis

This work addresses the challenge of achieving natural, human-like spoken conversation by enhancing LLMs with paralinguistic cues, though it is incremental as it builds on existing multimodal and multitasking approaches.

The paper tackles the problem of standard LLMs ignoring paralinguistic information like sentiment in spoken dialogue, proposing ParalinGPT, a multimodal LLM that integrates text and speech to model these attributes, achieving relative improvements of 6.7% in current sentiment accuracy, 12.0% in response sentiment accuracy, and 3.5% in response text BLEU score.

Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes