CVCLAug 31, 2023

Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

CambridgeDeepMind
arXiv:2308.16463v332 citationsh-index: 31Has Code
Originality Incremental advance
AI Analysis

This addresses a bottleneck in AI for applications requiring coherent multi-image dialogues, but it is incremental as it builds on existing models like MiniGPT-4 and LLaVA.

The paper tackled the problem of multimodal instruction-following models struggling with dialogue coherence across multiple images by introducing SparklesDialogue, a dataset for multi-image and text interactions, and SparklesChat, a model trained on it, which improved comprehension without harming single-image capabilities.

Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. Our experiments validate the effectiveness of training SparklesChat with SparklesDialogue based on MiniGPT-4 and LLaVA-v1.5, which enhances comprehension across multiple images and dialogue turns, and does not compromise single-image understanding capabilities. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources related to this study are publicly available at https://github.com/HYPJUDY/Sparkles.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes