EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates

Ludovic Moncla, Pierre Nugues, Thierry Joliveau, Katherine McDonough

arXiv:2602.23941v10.6h-index: 28

Originality Incremental advance

AI Analysis

This work addresses the challenge of extracting structured geographic data from digitized early modern texts for historians and digital humanities researchers, representing an incremental improvement with a new annotated dataset and method.

The paper tackles the problem of automatically recovering geographic coordinates from historical texts by creating a gold standard dataset from Diderot and d'Alembert's eighteenth-century Encyclopédie, training transformer-based models to retrieve and normalize coordinates, achieving an 86% EM score on cross-validation and demonstrating cross-lingual and cross-domain generalizability with scores of 61% on an eighteenth-century French dictionary and 77% on a nineteenth-century English encyclopedia.

This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.

View on arXiv PDF

Similar