CV AI CL LGJun 4, 2024

Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

arXiv:2406.02539v321.825 citationsHas Code

Originality Highly original

AI Analysis

This addresses the issue of language bias in multimodal AI systems, which is crucial for global accessibility, though it is incremental as it builds on existing alignment methods.

The paper tackles the problem of multilingual visual instruction tuning in multimodal large language models, where existing methods degrade performance on non-English languages due to imbalanced datasets, and proposes PARROT, which uses textual guidance and Mixture-of-Experts for alignment, achieving state-of-the-art performance on multilingual benchmarks and multimodal tasks.

The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4o, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level. PARROT conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. PARROT achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: https://github.com/AIDC-AI/Parrot

View on arXiv PDF Code

Similar