CVCLJul 3, 2023

Visual Instruction Tuning with Polite Flamingo

NVIDIA
arXiv:2307.01003v254 citationsh-index: 25
AI Analysis

This addresses the problem of reduced human preference in multi-modal LLMs for vision-language tasks, though it appears incremental as it builds on existing fine-tuning approaches.

The paper tackles the 'multi-modal alignment tax' problem where vision-language models lose politeness during fine-tuning due to raw annotations, by introducing Polite Flamingo to rewrite responses into polite formats and creating the PF-1M dataset. The resulting Clever Flamingo model shows improved multi-modal understanding and response politeness in evaluations.

Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately -- for instance, its "politeness" -- due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes