CLNov 8, 2024

Evaluating and Adapting Large Language Models to Represent Folktales in Low-Resource Languages

arXiv:2411.05593v12.73 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of digital folklore research by adapting LLMs for low-resource languages, but it is incremental as it builds on existing methods with modest gains.

The study evaluated large language models (LLMs) for representing folktales in low-resource languages like Irish and Gaelic, finding that adaptations such as handling longer sequences and domain-specific pre-training improved classification performance, though a baseline SVM with non-contextual features performed comparably well.

Folktales are a rich resource of knowledge about the society and culture of a civilisation. Digital folklore research aims to use automated techniques to better understand these folktales, and it relies on abstract representations of the textual data. Although a number of large language models (LLMs) claim to be able to represent low-resource langauges such as Irish and Gaelic, we present two classification tasks to explore how useful these representations are, and three adaptations to improve the performance of these models. We find that adapting the models to work with longer sequences, and continuing pre-training on the domain of folktales improves classification performance, although these findings are tempered by the impressive performance of a baseline SVM with non-contextual features.

View on arXiv PDF

Similar