CVAICLLGMay 24, 2024

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

arXiv:2405.15973v473 citationsh-index: 36NAACL
Originality Highly original
AI Analysis

This addresses the issue of unstable alignment in LVLMs for tasks like visual question-answering, offering a more controllable method without external models or data.

The paper tackles the problem of aligning visual and language modalities in large vision-language models (LVLMs) by proposing SIMA, a self-improvement framework that eliminates external dependencies and uses self-generated responses with an in-context self-critic mechanism, achieving significant performance improvements across 14 benchmarks.

Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there remains significant room for improvement in aligning visual and language modalities. Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results. In this paper, we propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies. SIMA leverages existing vision instruction tuning datasets to self-generate responses, incorporating an in-context self-critic mechanism that constructs preference pairs for tuning. Crucially, our approach allows LVLMs to act as critics by designing effective critic prompts, eliminating the need for additional fine-tuning with external instruction data. We introduce three novel visual metrics within the self-critic process to guide judgment, significantly improving the accuracy of self-critic. Through extensive experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA significantly improves LVLM's performance and outperforms previous approaches, achieving superior modality alignment.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes