CL AIJul 19, 2023

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Huan Wang, Silvio Savarese, Caiming Xiong

Salesforce

arXiv:2307.10172v320.6113 citationsh-index: 112Has Code

Originality Synthesis-oriented

AI Analysis

This provides a more comprehensive resource for researchers and practitioners in conversational AI, though it is incremental as it aggregates and formats existing datasets.

The authors tackled the lack of diversity and comprehensiveness in existing dialogue dataset collections by introducing DialogStudio, a unified collection of datasets across various conversational tasks, which demonstrated superiority in zero-shot and few-shot learning experiments.

Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training. To further enhance the utility of DialogStudio, we identify the licenses for each dataset, design external knowledge and domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we develop conversational AI models using the dataset collection, and our experiments in both zero-shot and few-shot learning scenarios demonstrate the superiority of DialogStudio. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio are made publicly accessible\footnote{\url{https://github.com/salesforce/DialogStudio}}.

View on arXiv PDF Code

Similar