The InviTE Corpus: Annotating Invectives in Tudor English Texts for Computational Modeling
This work addresses the study of invective language in historical texts for researchers in NLP and Tudor history, but it is incremental as it applies existing methods to a new dataset.
The paper tackled the problem of detecting religious invectives in Tudor English texts by introducing the InviTE corpus of nearly 2000 annotated sentences, and found that fine-tuned BERT-based models pre-trained on historical data outperformed zero-shot LLMs in this task.
In this paper, we aim at the application of Natural Language Processing (NLP) techniques to historical research endeavors, particularly addressing the study of religious invectives in the context of the Protestant Reformation in Tudor England. We outline a workflow spanning from raw data, through pre-processing and data selection, to an iterative annotation process. As a result, we introduce the InviTE corpus -- a corpus of almost 2000 Early Modern English (EModE) sentences, which are enriched with expert annotations regarding invective language throughout 16th-century England. Subsequently, we assess and compare the performance of fine-tuned BERT-based models and zero-shot prompted instruction-tuned large language models (LLMs), which highlights the superiority of models pre-trained on historical data and fine-tuned to invective detection.