CLHCSDASJan 10, 2024

Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

arXiv:2401.04868v124 citationsh-index: 19Has Code
Originality Synthesis-oriented
AI Analysis

This addresses the problem of enabling smooth conversational interactions for applications like virtual assistants, though it appears incremental as it builds on existing methods like contrastive predictive coding and transformers.

The paper tackled real-time turn-taking prediction in dialogue by introducing a voice activity projection (VAP) model that maps stereo audio to future voice activities, demonstrating it can operate in real-time on CPU with minimal performance degradation.

A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input context audio length and demonstrate that the proposed system can operate in real-time with CPU settings, with minimal performance degradation.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes