CVAICLLGJan 30, 2025

LLMs can see and hear without any training

Meta AI
arXiv:2501.18096v111 citationsh-index: 30ICML
Originality Highly original
AI Analysis

This provides a training-free solution for multimodal tasks, potentially reducing data and computational costs for users in AI and media applications.

The paper tackles the problem of enabling multimodal capabilities in LLMs without training, achieving a new state-of-the-art in zero-shot image, video, and audio captioning.

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes