CLCVFeb 22, 2024

PALO: A Polyglot Large Multimodal Model for 5B People

arXiv:2402.14818v231 citationsh-index: 55Has CodeWACV
AI Analysis

This addresses the problem of underrepresented languages in AI for a global audience, representing an incremental advancement in multilingual multimodal modeling.

The study tackled the lack of inclusivity in Vision-Language Models by introducing PALO, a polyglot large multimodal model that provides visual reasoning capabilities in 10 major languages covering about 5 billion people, achieving substantial improvements over strong baselines across three parameter scales.

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes