Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions
This work addresses conversational AI by improving backchannel prediction, which is incremental as it builds on existing methods with a new listener embedding component.
The paper tackles the problem of predicting backchannels in conversations by developing a proactive listener system that uses lexical, acoustic, and novel listener embeddings to mimic different behaviors. Results on the Switchboard dataset show that acoustic cues are more important than lexical ones, and combining them with listener embeddings yields the best performance on both manual and automatic transcriptions.
This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at developing a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of using listener embeddings to mimic different backchanneling behaviours. Our experimental results on the Switchboard benchmark dataset reveal that acoustic cues are more important than lexical cues in this task and their combination with listener embeddings works best on both, manual transcriptions and automatically generated transcriptions.