CV AI CL LGJan 30, 2025

LLMs can see and hear without any training

Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

Meta AI

arXiv:2501.18096v116.411 citationsh-index: 30Has CodeICML

Originality Highly original

AI Analysis

This provides a training-free solution for multimodal tasks, potentially reducing data and computational costs for users in AI and media applications.

The paper tackles the problem of enabling multimodal capabilities in LLMs without training, achieving a new state-of-the-art in zero-shot image, video, and audio captioning.

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

View on arXiv PDF Code

Similar