Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection
This addresses the problem of enabling smooth conversational interactions for applications like virtual assistants, though it appears incremental as it builds on existing methods like contrastive predictive coding and transformers.
The paper tackled real-time turn-taking prediction in dialogue by introducing a voice activity projection (VAP) model that maps stereo audio to future voice activities, demonstrating it can operate in real-time on CPU with minimal performance degradation.
A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.