CL AIMar 20

Curation of a Palaeohispanic Dataset for Machine Learning

Gonzalo Martínez-Fernández, Jose F Quesada, Agustín Riscos-Núñez, Francisco José Salguero-Lamillar

arXiv:2604.130702.8h-index: 9

AI Analysis

This work addresses the challenge for linguists and researchers studying ancient Iberian languages by providing a curated dataset, though it is incremental as it focuses on data preparation rather than new methods.

The paper tackled the problem of limited and unsuitable resources for studying Palaeohispanic languages by constructing a structured dataset, enabling computational approaches like machine learning to advance research in this field.

Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after Gómez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.

View on arXiv PDF

Similar