CLNov 15, 2024

Xmodel-1.5: An 1B-scale Multilingual LLM

arXiv:2411.10083v3h-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for balanced performance and scalability in multilingual AI, particularly benefiting low-resource language research, though it is incremental with a custom tokenizer and dataset.

The researchers tackled the problem of developing a scalable and efficient multilingual large language model, resulting in Xmodel-1.5, a 1-billion-parameter model that outperforms Alibaba's PolyLM-1.7B on multiple languages and achieves state-of-the-art results in Thai.

We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling culturally specific nuances. We hope this work contributes to advancements in multilingual AI research. Models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM-1.5

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes