CVAug 19, 2024

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

Yassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

arXiv:2408.10433v126.954 citationsh-index: 32

Originality Incremental advance

AI Analysis

This addresses robustness issues in LVLMs for real-world deployment, though it is incremental as it builds on existing DPO and CLIP methods.

The paper tackles hallucinations in Large Vision-Language Models (LVLMs) by introducing CLIP-DPO, a preference optimization method that uses CLIP to rank predictions and fine-tune models, resulting in significant reductions in hallucinations and improved zero-shot classification while preserving benchmark performance.

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

View on arXiv PDF

Similar