CLNov 15, 2024

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

arXiv:2411.10557v34 citationsh-index: 15Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)
Originality Incremental advance
AI Analysis

This addresses efficiency and knowledge transfer challenges in multimodal AI training, though it appears incremental as it builds on existing instruction tuning frameworks.

The paper tackles the problem of improving zero-shot task generalization in multimodal large language models by proposing a text-heavy visual instruction tuning approach that performs on-par with vision-heavy methods across 12 datasets while using up to half the training tokens.

We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on-par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes