CLNov 1, 2023

Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew

Eylon Gueta, Omer Goldman, Reut Tsarfaty

arXiv:2311.00658v13.68 citationsh-index: 30

Originality Incremental advance

AI Analysis

This work addresses the problem of better language modeling for morphologically-rich languages like Hebrew, offering an incremental improvement over standard approaches.

The authors tackled the sub-optimal performance of pre-trained language models on morphologically-rich languages by incorporating explicit morphological knowledge through tokenization methods, resulting in improved results on Hebrew benchmarks for semantic and morphologic tasks.

Pre-trained language models (PLMs) have shown remarkable successes in acquiring a wide range of linguistic knowledge, relying solely on self-supervised training on text streams. Nevertheless, the effectiveness of this language-agnostic approach has been frequently questioned for its sub-optimal performance when applied to morphologically-rich languages (MRLs). We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for MRLs. We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text. We pre-train multiple language models utilizing the different methods and evaluate them on Hebrew, a language with complex and highly ambiguous morphology. Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization, on a benchmark of both semantic and morphologic tasks. These findings suggest that incorporating morphological knowledge holds the potential for further improving PLMs for morphologically rich languages.

View on arXiv PDF

Similar