CLJun 13, 2023

Hybrid lemmatization in HuSpaCy

arXiv:2306.07636v11 citationsh-index: 23
Originality Synthesis-oriented
AI Analysis

This work addresses lemmatization challenges for Hungarian, an incremental improvement over existing hybrid methods.

The paper tackles lemmatization for morphologically rich languages by presenting a hybrid lemmatizer that combines a neural model, dictionaries, and hand-crafted rules, achieving empirical results on a Hungarian dataset and releasing three HuSpaCy models.

Lemmatization is still not a trivial task for morphologically rich languages. Previous studies showed that hybrid architectures usually work better for these languages and can yield great results. This paper presents a hybrid lemmatizer utilizing both a neural model, dictionaries and hand-crafted rules. We introduce a hybrid architecture along with empirical results on a widely used Hungarian dataset. The presented methods are published as three HuSpaCy models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes