CLJul 12, 2023

PolyLM: An Open Source Polyglot Large Language Model

arXiv:2307.06018v174 citationsh-index: 48Has Code
Originality Incremental advance
AI Analysis

This addresses the need for more inclusive AI by improving multilingual support, though it is incremental as it builds on existing LLM frameworks.

The paper tackles the problem of limited multilingual capabilities in large language models by introducing PolyLM, a polyglot model trained on 640B tokens, which outperforms open-source models like LLaMA and BLOOM on multilingual tasks while maintaining comparable English performance.

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes