CLAIJul 23, 2022

Context based lemmatizer for Polish language

arXiv:2207.11565v12 citationsh-index: 9
Originality Synthesis-oriented
AI Analysis

This addresses lemmatization for Polish language processing, which is an incremental improvement using an existing method on new data.

The paper tackles lemmatization for Polish by developing a model based on Google T5, achieving the best results for this language through training with varying context lengths.

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence. As a result, developing efficient lemmatisation algorithm is the complex task. In recent years it can be observed that deep learning models used for this task outperform other methods including machine learning algorithms. In this paper the polish lemmatizer based on Google T5 model is presented. The training was run with different context lengths. The model achieves the best results for polish language lemmatisation process.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes