CL AIDec 21, 2024

MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

Xin Huang, Tarun Kumar Vangani, Minh Duc Pham, Xunlong Zou, Bin Wang, Zhengyuan Liu, Ai Ti Aw

arXiv:2501.08335v39.64 citationsh-index: 16Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses cross-lingual understanding challenges for researchers and developers working with under-resourced languages, though it is incremental as it builds on existing models.

The authors tackled the problem of varying efficacy in multilingual large language models for languages with limited resources by developing MERaLiON-TextLLM, a series tailored for Chinese, Indonesian, Malay, and Singlish, which achieved performance improvements over official Llama-3 models on benchmarks in these languages.

Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.

View on arXiv PDF

Similar