CLDec 21, 2022

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

arXiv:2212.10773v3282 citationsh-index: 20Has Code
Originality Incremental advance
AI Analysis

This work addresses the lack of instruction tuning for multimodal tasks, which is incremental as it adapts an existing paradigm to a new domain.

The paper tackles the problem of extending instruction tuning to vision and multimodal tasks by introducing the first multimodal instruction tuning benchmark dataset, MUL-TIINSTRUCT, and demonstrates strong zero-shot performance on unseen multimodal tasks with improved sensitivity to instruction variations.

Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks. However, it has yet to be explored for vision and multimodal tasks. In this work, we introduce MUL-TIINSTRUCT, the first multimodal instruction tuning benchmark dataset that consists of 62 diverse multimodal tasks in a unified seq-to-seq format covering 10 broad categories. The tasks are derived from 21 existing open-source datasets and each task is equipped with 5 expert-written instructions. We take OFA as the base pre-trained model for multimodal instruction tuning, and to further improve its zero-shot performance, we explore multiple transfer learning strategies to leverage the large-scale NATURAL INSTRUCTIONS dataset. Experimental results demonstrate strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from a text-only instruction dataset. We also design a new evaluation metric - Sensitivity, to evaluate how sensitive the model is to the variety of instructions. Our results indicate that fine-tuning the model on a diverse set of tasks and instructions leads to a reduced sensitivity to variations in instructions for each task.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes