CLJul 9, 2024

Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel

arXiv:2407.07080v17.717 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses the problem of adapting LLMs to low-resource languages for researchers and practitioners in multilingual NLP, though it is incremental as it builds on existing models like Mistral.

The paper tackled the challenge of training large language models (LLMs) for low-resource languages like Hebrew by introducing DictaLM 2.0 and DictaLM 2.0-Instruct, trained on 200 billion tokens, and achieved enhanced performance on tasks such as Question Answering and Translation through a new benchmark suite.

Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English. We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions. To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including Question Answering, Sentiment Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.

View on arXiv PDF

Similar