CLOct 28, 2025

Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyunghyun Cho, Alice Oh

arXiv:2510.24541v1h-index: 13

Originality Synthesis-oriented

AI Analysis

This provides a foundational resource for quantitative diachronic analysis and NLP applications for Korean language researchers and practitioners, though it is incremental as it addresses a data gap rather than proposing a new method.

The authors tackled the lack of accessible historical corpora for the Korean language by introducing the Open Korean Historical Corpus, a large-scale dataset spanning 1,300 years with 18 million documents and 5 billion tokens, and used it to quantitatively analyze linguistic shifts such as the rapid transition from Hanja to Hangul starting around 1890 and North Korea's lexical divergence causing up to 51 times higher out-of-vocabulary rates.

The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.

View on arXiv PDF

Similar