CLOct 24, 2025

HalleluBERT: Let every token that has meaning bear its weight

arXiv:2510.21372v11 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This addresses the problem of limited Hebrew NLP resources for researchers and practitioners, but it is incremental as it applies an existing method to a new language-specific dataset.

The authors tackled the lack of a large-scale RoBERTa encoder for Hebrew by training HalleluBERT from scratch on 49.1 GB of Hebrew text, and it outperformed existing models on NER and sentiment classification benchmarks, setting a new state of the art for Hebrew.

Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes