CLDLJun 26, 2023

Transfer Learning across Several Centuries: Machine and Historian Integrated Method to Decipher Royal Secretary's Diary

arXiv:2306.14592v11 citationsh-index: 14
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of scarce annotated data for historians analyzing centuries-old documents, though it is incremental in applying existing methods to a new historical dataset.

The paper tackled the problem of named entity recognition in historical Korean texts by fine-tuning language models on a newly annotated corpus, finding that phrase markers combined with time-specific training improve performance on unseen entities from different centuries.

A named entity recognition and classification plays the first and foremost important role in capturing semantics in data and anchoring in translation as well as downstream study for history. However, NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model. This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated. We fined-tuned the language model on history corpus, conducted extensive comparative experiments using our language model and pretrained muti-language models. We set up the hypothesis of combination of time and annotation information and tested it based on statistical t test. Our finding shows that phrase markers clearly improve the performance of NER model in predicting unseen entity in documents written far different time period. It also shows that each of phrase marker and corpus-specific trained model does not improve the performance. We discuss the future research directions and practical strategies to decipher the history document.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes