CLAICVOct 25, 2023

Apollo: Zero-shot MultiModal Reasoning with Multiple Experts

arXiv:2310.18369v1h-index: 30Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of costly multimodal training for AI researchers and practitioners by enabling zero-shot, decentralized reasoning across modalities, though it is incremental in leveraging existing models.

The authors tackled complex multimodal tasks without task-specific training by proposing a modular framework that leverages multiple foundation models across modalities, achieving state-of-the-art performance in stylized image captioning and demonstrating it on a novel audio-aware image captioning task.

We propose a modular framework that leverages the expertise of different foundation models over different modalities and domains in order to perform a single, complex, multi-modal task, without relying on prompt engineering or otherwise tailor-made multi-modal training. Our approach enables decentralized command execution and allows each model to both contribute and benefit from the expertise of the other models. Our method can be extended to a variety of foundation models (including audio and vision), above and beyond only language models, as it does not depend on prompts. We demonstrate our approach on two tasks. On the well-known task of stylized image captioning, our experiments show that our approach outperforms semi-supervised state-of-the-art models, while being zero-shot and avoiding costly training, data collection, and prompt engineering. We further demonstrate this method on a novel task, audio-aware image captioning, in which an image and audio are given and the task is to generate text that describes the image within the context of the provided audio. Our code is available on GitHub.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes