Language and Dialect Identification of Cuneiform Texts
This work addresses the challenge of language identification for researchers in archaeology and linguistics, but it is incremental as it focuses on establishing a dataset and initial baselines.
The authors tackled the problem of automatically identifying languages and dialects in cuneiform texts by creating a corpus and dataset, and they conducted preliminary experiments that provided baseline results, marking the first application of such methods to this type of data.
This article introduces a corpus of cuneiform texts from which the dataset for the use of the Cuneiform Language Identification (CLI) 2019 shared task was derived as well as some preliminary language identification experiments conducted using that corpus. We also describe the CLI dataset and how it was derived from the corpus. In addition, we provide some baseline language identification results using the CLI dataset. To the best of our knowledge, the experiments detailed here are the first time automatic language identification methods have been used on cuneiform data.