CLDec 28, 2021

Processing M.A. Castrén's Materials: Multilingual Typed and Handwritten Manuscripts

Niko Partanen, Jack Rueter, Mika Hämäläinen, Khalid Alnajjar

arXiv:2112.14153v10.2

Originality Synthesis-oriented

AI Analysis

This work aids researchers in linguistics and digital humanities by enhancing access to archived historical materials, though it is incremental as it builds on existing cultural and linguistic studies.

The study reports on technical workflows and infrastructure for processing multilingual typed and handwritten manuscripts from Matthias Alexander Castrén's collections, creating openly available datasets to improve usability and provide benchmarks for text recognition tasks.

The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813-1852). The Finno-Ugrian Society is publishing Castrén's manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.

View on arXiv PDF

Similar