CLApr 7, 2020

Interview: A Large-Scale Open-Source Corpus of Media Dialog

Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley

arXiv:2004.03090v11.18 citations

Originality Incremental advance

AI Analysis

This provides a valuable resource for developing more engaging and responsive dialog systems, though it is incremental as it builds on existing conversational data efforts.

The authors tackled the lack of large-scale natural speech dialog datasets by introducing 'Interview', a corpus of 105K news interview transcripts, which improves zero-shot out-of-domain performance for language models on spoken dialog tasks.

Existing conversational datasets consist either of written proxies for dialog or small-scale transcriptions of natural speech. We introduce 'Interview': a large-scale (105K conversations) media dialog dataset collected from news interview transcripts. Compared to existing large-scale proxies for conversational data, language models trained on our dataset exhibit better zero-shot out-of-domain performance on existing spoken dialog datasets, demonstrating its usefulness in modeling real-world conversations. 'Interview' contains speaker role annotations for each turn, facilitating the development of engaging, responsive dialog systems. In fact, experiments on two dialog tasks show that leveraging such labels improves performance over strong speaker-agnostic baselines, and enabling models to generate more specific and inquisitive responses in interview-style conversations.

View on arXiv PDF

Similar