Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion
This work addresses the challenge of enabling more natural human-AI conversational interactions, though it appears incremental as it builds on existing models with a fusion approach.
The paper tackles the problem of predicting turn-taking and backchannel locations in spoken dialogue by fusing neural acoustic models with large language models, achieving consistent improvements over single-modality baselines on the Switchboard dataset.
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.