CLAILGMay 21, 2024

Tagengo: A Multilingual Chat Dataset

arXiv:2405.12612v124 citationsh-index: 1Has CodeMRL
Originality Incremental advance
AI Analysis

This work addresses the problem of limited language accessibility in LLMs for users of less common languages, though it is incremental as it builds on existing models and datasets.

The authors tackled the lack of multilingual chat capabilities in open-source LLMs by creating a dataset of over 70k prompt-response pairs in 74 languages and training a model that outperforms previous state-of-the-art open-source LLMs on MT-Bench benchmarks in 6 languages.

Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually. We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language. These results indicate the necessity of training on large amounts of high quality multilingual data to make a more accessible LLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes