Representation of perceived prosodic similarity of conversational feedback
This work addresses the challenge of improving conversational systems by better modeling prosodic cues in feedback, though it is incremental as it builds on existing representation methods.
The study tackled the problem of how well existing speech representations capture perceived prosodic similarity in conversational feedback, finding that spectral and self-supervised representations outperform pitch features, especially for same-speaker feedback, and that contrastive learning can align these representations with human perception.
Vocal feedback (e.g., `mhm', `yeah', `okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.