Visual Cues Support Robust Turn-taking Prediction in Noise
This addresses the challenge of robust turn-taking prediction in noisy environments for human-robot interaction, though it is incremental as it builds on existing PTTMs by adding visual cues.
The study tackled the problem of predictive turn-taking models (PTTMs) being highly sensitive to noise in human-robot interaction, showing that hold/shift accuracy drops from 84% in clean speech to 52% in 10 dB music noise, but a multimodal PTTM with visual features improves accuracy to 72% in the same noise condition.
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.