CLFeb 2

Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs

Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel

arXiv:2602.02104v11.62 citationsh-index: 6

Originality Synthesis-oriented

AI Analysis

This work addresses the demand for Hebrew language models, which is important for Hebrew-speaking users and researchers, though it is incremental as it adapts existing base models.

The authors tackled the problem of low supply of sovereign LLMs for low-resource languages like Hebrew by introducing Dicta-LM 3.0, an open-weight collection of Hebrew LLMs in three sizes (24B, 12B, 1.7B) with 65k token context length, and created a new benchmark suite for Hebrew chat-LLMs covering tasks such as Translation, Summarization, and Diacritization.

Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.

View on arXiv PDF

Similar