SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
This addresses data scarcity for researchers and practitioners in open-domain social dialogue by providing a large-scale dataset and a more natural conversation model, though it is incremental as it builds on existing knowledge graph and LLM methods.
The authors tackled data scarcity in open-domain social dialogue by creating SODA, a million-scale high-quality dataset distilled from a large language model using social commonsense contextualization, which human evaluation found more consistent, specific, and natural than prior human-authored datasets. They used SODA to train COSMO, a conversation model that outperforms state-of-the-art models like GODEL and BlenderBot-1 in naturalness and consistency on unseen datasets, sometimes even preferred over human-written responses.
Data scarcity has been a long standing issue in the field of open-domain social dialogue. To quench this thirst, we present SODA: the first publicly available, million-scale high-quality social dialogue dataset. By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions from a large language model. Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets. Using SODA, we train COSMO: a generalizable conversation model that is significantly more natural and consistent on unseen datasets than best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna). Experiments reveal COSMO is sometimes even preferred to the original human-written gold responses. Additionally, our results shed light on the distinction between knowledge-enriched conversations and natural social chitchats. We plan to make our data, model, and code public.