Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction
This work addresses turn-taking prediction for conversational AI systems, but it appears incremental as it builds on existing strategies without claiming major breakthroughs.
The paper tackled turn-taking prediction in conversations by integrating large language models and voice activity projection models in a multi-modal ensemble, aiming to improve accuracy and efficiency on datasets like ICC and CCPE, but no concrete results or numbers were provided.
Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.