CLAIJun 26, 2024

S3: A Simple Strong Sample-effective Multimodal Dialog System

arXiv:2406.18305v1Has Code
Originality Incremental advance
AI Analysis

This provides a strong baseline for multimodal dialog tasks, though it appears incremental as it builds on existing pre-trained models and methods.

The authors tackled multimodal dialog systems by proposing S3, a simple baseline model that achieves near state-of-the-art results on the MMMU and AI Journey Contest 2023 leaderboards, demonstrating efficient performance with a small amount of multimodal training data.

In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes