CLMay 25, 2022

Generating Information-Seeking Conversations from Unlabeled Documents

arXiv:2205.12609v2299 citationsh-index: 45
Originality Incremental advance
AI Analysis

This addresses the problem of limited training data for conversational AI systems, providing a scalable synthetic resource for researchers and developers, though it is incremental as it builds on existing CQA methods.

The paper introduces SIMSEEK, a framework for generating information-seeking conversations from unlabeled documents, and shows that its asymmetric variant improves conversational question answering (CQA) tasks, achieving state-of-the-art performance on the QuAC benchmark with a released dataset of 2 million CQA pairs.

In this paper, we introduce a novel framework, SIMSEEK, (Simulating information-Seeking conversation from unlabeled documents), and compare its two variants. In our baseline SIMSEEK-SYM, a questioner generates follow-up questions upon the predetermined answer by an answerer. On the contrary, SIMSEEK-ASYM first generates the question and then finds its corresponding answer under the conversational context. Our experiments show that they can synthesize effective training resources for CQA and conversational search tasks. As a result, conversations from SIMSEEK-ASYM not only make more improvements in our experiments but also are favorably reviewed in a human evaluation. We finally release a large-scale resource of synthetic conversations, WIKI-SIMSEEK, containing 2 million CQA pairs built upon Wikipedia documents. With the dataset, our CQA model achieves state-of-the-art performance on a recent CQA benchmark, QuAC.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes