Silkie: Preference Distillation for Large Visual Language Models
This work addresses the challenge of enhancing response quality and reducing hallucinations in large vision-language models for applications in multimodal AI, representing a strong specific gain rather than a foundational advancement.
The paper tackles the problem of improving large vision-language models' ability to generate helpful and faithful responses by using preference distillation, resulting in a model called Silkie that achieves 6.9% and 9.5% relative improvements on perception and cognition in the MME benchmark and sets a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark for reduced hallucination.
This paper explores preference distillation for large vision language models (LVLMs), improving their ability to generate helpful and faithful responses anchoring the visual context. We first build a vision-language feedback (VLFeedback) dataset utilizing AI annotation. Specifically, responses are generated by models sampled from 12 LVLMs, conditioned on multi-modal instructions sourced from various datasets. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. Furthermore, the preference supervision is distilled into Qwen-VL-Chat through the direct preference optimization (DPO) method. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Silkie also demonstrates reduced hallucination by setting a new state-of-the-art score of 3.02 on the MMHal-Bench benchmark. Further analysis shows that DPO with our VLFeedback dataset mainly boosts the fine-grained perception and complex cognition abilities of LVLMs, leading to more comprehensive improvements compared to human-annotated preference datasets.