CLNov 1, 2023

Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew

arXiv:2311.00658v18 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses the problem of better language modeling for morphologically-rich languages like Hebrew, offering an incremental improvement over standard approaches.

The authors tackled the sub-optimal performance of pre-trained language models on morphologically-rich languages by incorporating explicit morphological knowledge through tokenization methods, resulting in improved results on Hebrew benchmarks for semantic and morphologic tasks.

Pre-trained language models (PLMs) have shown remarkable successes in acquiring a wide range of linguistic knowledge, relying solely on self-supervised training on text streams. Nevertheless, the effectiveness of this language-agnostic approach has been frequently questioned for its sub-optimal performance when applied to morphologically-rich languages (MRLs). We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for MRLs. We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text. We pre-train multiple language models utilizing the different methods and evaluate them on Hebrew, a language with complex and highly ambiguous morphology. Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization, on a benchmark of both semantic and morphologic tasks. These findings suggest that incorporating morphological knowledge holds the potential for further improving PLMs for morphologically rich languages.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes