CL AIJun 2

Pretraining Language Models on Historical Text

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

arXiv:2606.0299125.6

Predicted impact top 89% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work enables historically accurate language modeling for researchers studying pre-modern English texts, addressing a critical gap in temporal data leakage and evaluation.

TypewriterLM, a 7.24B parameter language model trained exclusively on pre-1913 English text, achieves temporal consistency and historical accuracy through a novel post-training framework and a 54B-token corpus, with evaluations showing reduced temporal leakage.

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

View on arXiv PDF

Similar