LGAICLCVNov 22, 2024

Continual SFT Matches Multimodal RLHF with Negative Supervision

arXiv:2411.14797v16 citationsh-index: 7CVPR
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently aligning large vision-language models for improved comprehension, offering a more memory-efficient alternative to existing methods.

The paper tackles the problem of aligning vision-language models (VLMs) during preference alignment by proposing negative supervised finetuning (nSFT), which matches the performance of multimodal RLHF approaches while being more memory-efficient, as demonstrated across various datasets, base VLMs, and evaluation metrics.

Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes