CLAIJun 2

Pretraining Language Models on Historical Text

arXiv:2606.0299125.6
Predicted impact top 89% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This work enables historically accurate language modeling for researchers studying pre-modern English texts, addressing a critical gap in temporal data leakage and evaluation.

TypewriterLM, a 7.24B parameter language model trained exclusively on pre-1913 English text, achieves temporal consistency and historical accuracy through a novel post-training framework and a 54B-token corpus, with evaluations showing reduced temporal leakage.

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes