LG AI GRFeb 4, 2025

Diffusion Instruction Tuning

Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Teare

arXiv:2502.06814v27.11 citationsh-index: 27Has CodeICML

Originality Incremental advance

AI Analysis

This provides a scalable solution for more accurate vision-language systems, though it is incremental as it builds on existing models and methods.

The paper tackles improving vision-language models by aligning their text-vision attention with Stable Diffusion during fine-tuning, resulting in up to 30% performance gains and a 68% boost on medical QA tasks with minimal training data.

We introduce Lavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model's visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples, 2.5% of typical large-scale SFT datasets, and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. All code, training data, and models will be shared at https://astrazeneca.github.io/vlm/.

View on arXiv PDF

Similar