Rdgai: Classifying transcriptional changes using Large Language Models with a test case from an Arabic Gospel tradition
This work addresses a specific bottleneck for researchers in textual criticism and phylogenetics by automating a tedious classification step, though it is incremental as it builds on existing probabilistic methods.
The paper tackles the problem of time-consuming manual classification of textual variants in phylogenetic analysis by introducing Rdgai, a software package that automates this task using multi-lingual large language models, resulting in a tool that reduces the barrier to entry for such analyses.
Application of phylogenetic methods to textual traditions has traditionally treated all changes as equivalent even though it is widely recognized that certain types of variants were more likely to be introduced than others. While it is possible to give weights to certain changes using a maximum parsimony evaluation criterion, it is difficult to state a priori what these weights should be. Probabilistic methods, such as Bayesian phylogenetics, allow users to create categories of changes, and the transition rates for each category can be estimated as part of the analysis. This classification of types of changes in readings also allows for inspecting the probability of these categories across each branch in the resulting trees. However, classification of readings is time-consuming, as it requires categorizing each reading against every other reading at each variation unit, presenting a significant barrier to entry for this kind of analysis. This paper presents Rdgai, a software package that automates this classification task using multi-lingual large language models (LLMs). The tool allows users to easily manually classify changes in readings and then it uses these annotations in the prompt for an LLM to automatically classify the remaining reading transitions. These classifications are stored in TEI XML and ready for downstream phylogenetic analysis. This paper demonstrates the application with data an Arabic translation of the Gospels.