How (not) to ensemble LVLMs for VQA
This work addresses the challenge of effectively combining diverse LVLMs for visual question answering, but it is incremental as it builds on classical ensembling techniques without introducing new paradigms.
The paper investigates ensembling methods for Large Vision-Language Models (LVLMs) on the Encyclopedic-VQA task, finding that while oracle experiments suggest potential accuracy gains from 48.8% to 67%, practical ensembling yields limited real improvements.
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?