CVAug 19, 2024

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

arXiv:2408.10433v154 citationsh-index: 32
Originality Incremental advance
AI Analysis

This addresses robustness issues in LVLMs for real-world deployment, though it is incremental as it builds on existing DPO and CLIP methods.

The paper tackles hallucinations in Large Vision-Language Models (LVLMs) by introducing CLIP-DPO, a preference optimization method that uses CLIP to rank predictions and fine-tune models, resulting in significant reductions in hallucinations and improved zero-shot classification while preserving benchmark performance.

Despite recent successes, LVLMs or Large Vision Language Models are prone to hallucinating details like objects and their properties or relations, limiting their real-world deployment. To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs. Unlike prior works tackling LVLM hallucinations, our method does not rely on paid-for APIs, and does not require additional training data or the deployment of other external LVLMs. Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. We applied CLIP-DPO fine-tuning to the MobileVLM-v2 family of models and to LlaVA-1.5, in all cases observing significant improvements in terms of hallucination reduction over baseline models. We also observe better performance for zero-shot classification, suggesting improved grounding capabilities, and verify that the original performance on standard LVLM benchmarks is overall preserved.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes