LGJun 7, 2021

Lessons learned developing and using a machine learning model to automatically transcribe 2.3 million handwritten occupation codes

arXiv:2106.03996v27 citationsHas Code
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of efficiently and accurately transcribing large-scale historical documents for researchers and archivists, though it is incremental as it applies existing methods to a specific dataset.

The authors tackled the problem of transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 census using an end-to-end machine learning pipeline, achieving 97% accuracy and requiring manual verification for only 3% of the codes.

Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification. We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be representative for the census as a whole. We believe our approach and lessons learned may be useful for other transcription projects that plan to use machine learning in production. The source code is available at: https://github.com/uit-hdl/rhd-codes

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes