CLAICVLGFeb 19, 2025

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

arXiv:2502.13487v23 citationsh-index: 6ACL
Originality Incremental advance
AI Analysis

This provides an efficient solution for enhancing multimodal content evaluation in AI systems, though it is incremental as it builds on existing models without new training.

The paper tackles the problem of limited content evaluation in large vision-language models (LVLMs) and the high computational cost of training vision-language reward models (VLRMs) by proposing a training-free method that merges text-based reward models with LVLMs to create VLRMs, resulting in improved performance over existing scoring methods.

Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes