Parrot: Multilingual Visual Instruction Tuning
This addresses the issue of language bias in multimodal AI systems, which is crucial for global accessibility, though it is incremental as it builds on existing alignment methods.
The paper tackles the problem of multilingual visual instruction tuning in multimodal large language models, where existing methods degrade performance on non-English languages due to imbalanced datasets, and proposes PARROT, which uses textual guidance and Mixture-of-Experts for alignment, achieving state-of-the-art performance on multilingual benchmarks and multimodal tasks.
The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. PARROT achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot