CLJul 7, 2021

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

arXiv:2107.03266v1714 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of processing historical Finnish texts for researchers and NLP applications, though it is incremental as it builds on existing digitization efforts.

The paper tackles the problem of applying modern NLP methods to Old Literary Finnish texts by proposing an approach for simultaneous normalization and lemmatization into modern spelling, achieving 96.3% accuracy on Agricola's texts and 87.7% on out-of-domain contemporary texts.

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes