Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems
This addresses turn-taking for spoken dialogue systems in multi-party scenarios, but it is incremental as it extends an existing method to a new data type.
This work tackled the problem of predicting turn-taking in triadic multi-party conversations by extending voice activity projection (VAP) from dyadic to triadic settings, finding that VAP trained on triadic data outperformed baselines with accuracy varying by conversation type.
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.