CLDec 20, 2022

SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization

Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Le Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, Yejin Choi

AI2CMUNVIDIAUW

arXiv:2212.10465v324.2209 citationsh-index: 111Has Code

Originality Incremental advance

AI Analysis

This addresses data scarcity for researchers and practitioners in open-domain social dialogue by providing a large-scale dataset and a more natural conversation model, though it is incremental as it builds on existing knowledge graph and LLM methods.

The authors tackled data scarcity in open-domain social dialogue by creating SODA, a million-scale high-quality dataset distilled from a large language model using social commonsense contextualization, which human evaluation found more consistent, specific, and natural than prior human-authored datasets. They used SODA to train COSMO, a conversation model that outperforms state-of-the-art models like GODEL and BlenderBot-1 in naturalness and consistency on unseen datasets, sometimes even preferred over human-written responses.

Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets. Using SODA, we train COSMO: a generalizable conversation model that is significantly more natural and consistent on unseen datasets than best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna). Experiments reveal COSMO is sometimes even preferred to the original human-written gold responses. Additionally, our results shed light on the distinction between knowledge-enriched conversations and natural social chitchats. We plan to make our data, model, and code public.

View on arXiv PDF Code

Similar