CLAIApr 4, 2024

Sailor: Open Language Models for South-East Asia

arXiv:2404.03608v132 citationsh-index: 15Has CodeEMNLP
Originality Synthesis-oriented
AI Analysis

This provides accessible language models for South-East Asian communities, though it is incremental as it builds on existing multilingual models.

The authors tackled the lack of open language models for South-East Asian languages by developing Sailor, a family of models from 0.5B to 7B parameters, which show strong performance on benchmarks like commonsense reasoning and question answering after continual pre-training on 200B to 400B tokens.

We present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing large language models for multilingual use cases.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes