CV AI LG PFApr 21, 2025

Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Yassir Benhammou, Alessandro Tiberio, Gabriel Trautmann, Suman Kalyan

arXiv:2504.15199v1h-index: 6

Originality Synthesis-oriented

AI Analysis

This work exposes trade-offs between quality and cost in multimodal models, providing insights for more efficient design, but it is incremental as it critiques an existing method without proposing a new solution.

The paper investigates the MILS framework for zero-shot image captioning, revealing that its good performance comes with a substantial computational cost due to an expensive multi-step refinement process, while alternative models like BLIP-2 and GPT-4V achieve competitive results more efficiently.

MILS (Multimodal Iterative LLM Solver) is a recently published framework that claims "LLMs can see and hear without any training" by leveraging an iterative, LLM-CLIP based approach for zero-shot image captioning. While this MILS approach demonstrates good performance, our investigation reveals that this success comes at a hidden, substantial computational cost due to its expensive multi-step refinement process. In contrast, alternative models such as BLIP-2 and GPT-4V achieve competitive results through a streamlined, single-pass approach. We hypothesize that the significant overhead inherent in MILS's iterative process may undermine its practical benefits, thereby challenging the narrative that zero-shot performance can be attained without incurring heavy resource demands. This work is the first to expose and quantify the trade-offs between output quality and computational cost in MILS, providing critical insights for the design of more efficient multimodal models.

View on arXiv PDF

Similar