CLAIDec 21, 2023

Typhoon: Thai Large Language Models

arXiv:2312.13951v135 citationsh-index: 13Has Code
Originality Synthesis-oriented
AI Analysis

This provides a practical solution for Thai language processing, though it is incremental as it adapts existing methods to a specific domain.

The researchers developed Typhoon, a series of Thai large language models, to address the challenge of low-resource languages by applying continual training from a strong LLM and creating a Thai-specific benchmark. The resulting 7-billion-parameter model outperforms all open-source Thai LLMs, matches GPT-3.5 in Thai performance, and is 2.62 times more efficient in Thai tokenization.

Typhoon is a series of Thai large language models (LLMs) developed specifically for the Thai language. This technical report presents challenges and insights in developing Thai LLMs, including data preparation, pretraining, instruction-tuning, and evaluation. As one of the challenges of low-resource languages is the amount of pretraining data, we apply continual training to transfer existing world knowledge from a strong LLM. To evaluate the Thai knowledge encapsulated in each model from the pretraining stage, we develop ThaiExam, a benchmark based on examinations for high-school students and investment professionals in Thailand. In addition, we fine-tune Typhoon to follow Thai instructions, and we evaluate instruction-tuned models on Thai instruction datasets as well as translation, summarization, and question-answering tasks. Experimental results on a suite of Thai benchmarks show that Typhoon outperforms all open-source Thai language models, and its performance is on par with GPT-3.5 in Thai while having only 7 billion parameters and being 2.62 times more efficient in tokenizing Thai text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes