CLAILGSDASJan 26, 2024

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

arXiv:2401.14717v124 citationsICASSP
Originality Incremental advance
AI Analysis

This work addresses the challenge of enabling more natural human-AI conversational interactions, though it appears incremental as it builds on existing models with a fusion approach.

The paper tackles the problem of predicting turn-taking and backchannel locations in spoken dialogue by fusing neural acoustic models with large language models, achieving consistent improvements over single-modality baselines on the Switchboard dataset.

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes