HalleluBERT: Let every token that has meaning bear its weight
This addresses the problem of limited Hebrew NLP resources for researchers and practitioners, but it is incremental as it applies an existing method to a new language-specific dataset.
The authors tackled the lack of a large-scale RoBERTa encoder for Hebrew by training HalleluBERT from scratch on 49.1 GB of Hebrew text, and it outperformed existing models on NER and sentiment classification benchmarks, setting a new state of the art for Hebrew.
Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.