CLAIMar 17, 2023

CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos

AI2Stanford
arXiv:2303.09713v222 citationsh-index: 111
Originality Incremental advance
AI Analysis

This work addresses the problem of enabling more realistic, visually-grounded conversations for AI systems, though it is incremental as it builds on existing vision-language methods with a new dataset and model.

The authors tackled the problem of neural conversational models being limited to text by introducing CHAMPAGNE, a generative model that incorporates visual contexts, and they collected YTD-18M, a large-scale corpus of 18M video-based dialogues, which human evaluation showed to be more sensible and specific than prior resources, and CHAMPAGNE achieved state-of-the-art results on four vision-language tasks when fine-tuned.

Visual information is central to conversation: body gestures and physical behaviour, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes