CLOct 24, 2025

HalleluBERT: Let every token that has meaning bear its weight

arXiv:2510.21372v11 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This addresses the problem of limited Hebrew NLP resources for researchers and practitioners, but it is incremental as it applies an existing method to a new language-specific dataset.

The authors tackled the lack of a large-scale RoBERTa encoder for Hebrew by training HalleluBERT from scratch on 49.1 GB of Hebrew text, and it outperformed existing models on NER and sentiment classification benchmarks, setting a new state of the art for Hebrew.

Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.

View on arXiv PDF

Similar