CL AI LG SD ASJan 26, 2024

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran

arXiv:2401.14717v19.624 citationsICASSP

Originality Incremental advance

AI Analysis

This work addresses the challenge of enabling more natural human-AI conversational interactions, though it appears incremental as it builds on existing models with a fusion approach.

The paper tackles the problem of predicting turn-taking and backchannel locations in spoken dialogue by fusing neural acoustic models with large language models, achieving consistent improvements over single-modality baselines on the Switchboard dataset.

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

View on arXiv PDF

Similar