CLAISep 17, 2024

Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

arXiv:2409.11500v15 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the need for scalable, high-quality training data for multi-document grounded dialog systems, though it is incremental as it builds on existing methods like Chain-of-Thought and retrieval-augmented generation.

The paper tackles the problem of generating synthetic multi-turn dialogs grounded in multiple documents, using techniques like taxonomy-driven queries and dynamic document updates, and finds that models fine-tuned on this synthetic data outperform those on human-generated data across four benchmark test sets.

We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes