CVCLOct 12, 2024

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Peking U
arXiv:2410.09421v259 citationsh-index: 29EMNLP
AI Analysis

This addresses the costly and time-intensive need for human supervision in aligning large vision-language models, offering a scalable AI-based solution that is incremental in applying existing methods to a new domain.

The paper tackles the problem of aligning large vision-language models by introducing VLFeedback, a large-scale AI feedback dataset with over 82K multi-modal instructions, and shows that training a model called Silkie with this data improves performance by 6.9% and 9.5% in perception and cognition tasks while reducing hallucinations and enhancing safety.

As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and cognition tasks, reduces hallucination issues on MMHal-Bench, and exhibits enhanced resilience against red-teaming attacks. Furthermore, our analysis underscores the advantage of AI feedback, particularly in fostering preference diversity to deliver more comprehensive improvements. Our dataset, training code and models are available at https://vlf-silkie.github.io.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes