CL LGMay 24, 2021

Neural Language Models for Nineteenth-Century English

Kasra Hosseini, Kaspar Beelen, Giovanni Colavizza, Mariona Coll Ardanuy

arXiv:2105.11321v10.51 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This work provides domain-specific tools for historical linguistics and digital humanities, but it is incremental as it applies existing methods to new historical data.

The authors tackled the problem of modeling historical English by training four neural language models on a large dataset of 5.1 billion tokens from 1760-1900, including static and contextualized architectures, and demonstrated consistent performance improvements in downstream tasks.

We present four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT. Our models have already been used in various downstream tasks where they consistently improved performance. In this paper, we describe how the models have been created and outline their reuse potential.

View on arXiv PDF Code

Similar