Transferring Textual Preferences to Vision-Language Understanding through Model Merging
This provides an efficient solution for enhancing multimodal content evaluation in AI systems, though it is incremental as it builds on existing models without new training.
The paper tackles the problem of limited content evaluation in large vision-language models (LVLMs) and the high computational cost of training vision-language reward models (VLRMs) by proposing a training-free method that merges text-based reward models with LVLMs to create VLRMs, resulting in improved performance over existing scoring methods.
Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.