CLApr 11, 2022

Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0

Hugging Face
arXiv:2204.05211v1640 citationsh-index: 13
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of analyzing historical texts in multiple languages without labeled data, which is incremental in applying existing zero-shot methods to a new domain.

The study investigated whether the T0 model's zero-shot capabilities could handle Named Entity Recognition for historical texts in out-of-distribution languages and time periods, finding that a naive prompt-based approach was error-prone but showed potential for languages lacking labeled datasets, and also demonstrated the model's ability to predict document publication dates and languages.

In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes