CLApr 23, 2024

Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

arXiv:2404.15003v1248 citationsh-index: 13NODALIDA
Originality Synthesis-oriented
AI Analysis

This work addresses lemmatization for Estonian, an incremental improvement in a domain-specific NLP task.

The study compared three lemmatization approaches for Estonian and found that a smaller generative model outperformed a pattern-based classification model, with small error overlaps suggesting ensemble methods could improve results.

This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small overlap in errors made by all three models, indicating that an ensemble of different approaches could lead to improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes