CLJun 4, 2025

Voice Activity Projection Model with Multimodal Encoders

arXiv:2506.03980v12.72 citationsh-index: 49Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of modeling turn-taking in social interactions for improved human-machine interfaces, representing an incremental improvement over existing multimodal VAP models.

The paper tackles the problem of turn-taking prediction in human-machine interaction by proposing a multimodal voice activity projection model enhanced with pre-trained audio and face encoders to capture subtle expressions, achieving competitive or better performance than state-of-the-art models on turn-taking metrics.

Turn-taking management is crucial for any social interaction. Still, it is challenging to model human-machine interaction due to the complexity of the social context and its multimodal nature. Unlike conventional systems based on silence duration, previous existing voice activity projection (VAP) models successfully utilized a unified representation of turn-taking behaviors as prediction targets, which improved turn-taking prediction performance. Recently, a multimodal VAP model outperformed the previous state-of-the-art model by a significant margin. In this paper, we propose a multimodal model enhanced with pre-trained audio and face encoders to improve performance by capturing subtle expressions. Our model performed competitively, and in some cases, even better than state-of-the-art models on turn-taking metrics. All the source codes and pretrained models are available at https://github.com/sagatake/VAPwithAudioFaceEncoders.

View on arXiv PDF Code

Similar