CVOct 28, 2024

Improving Generalization in Visual Reasoning via Self-Ensemble

Tien-Huy Nguyen, Quang-Khai Tran, Anh-Tuan Quang-Hoang

arXiv:2410.20883v213.513 citationsh-index: 3ECCV Workshops

Originality Incremental advance

AI Analysis

This addresses the resource-intensive training of large vision-language models for visual reasoning, offering a more efficient approach, though it is incremental as it builds on existing ensemble methods.

The paper tackles the problem of improving generalization in visual reasoning without costly training by proposing a training-free self-ensemble method that leverages a single large vision-language model's internal capabilities, achieving state-of-the-art performance on benchmarks like SketchyVQA, Outside Knowledge VQA, and out-of-distribution VQA tasks.

The cognitive faculty of visual reasoning necessitates the integration of multimodal perceptual processing and commonsense and external knowledge of the world. In recent years, a plethora of large vision-language models (LVLMs) have been proposed, demonstrating outstanding power and exceptional proficiency in commonsense reasoning across diverse domains and tasks. Nevertheless, training such LVLMs requires a lot of costly resources. Recent approaches, instead of training LVLMs from scratch on various large datasets, focus on exploring ways to take advantage of the capabilities of many different LVLMs, such as ensemble methods. In this work, we propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters, a training-free method. Our key insight is that we realized that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities. Extensive experiments on various benchmarks demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on SketchyVQA, Outside Knowledge VQA, and out-of-distribution VQA tasks.

View on arXiv PDF

Similar