Prompt and circumstance: A word-by-word LLM prompting approach to interlinear glossing for low-resource languages
This work addresses the challenge of linguistic documentation for low-resource languages by making glossing more accessible to linguists, though it is incremental as it builds on existing methods with LLMs.
The authors tackled the problem of automating interlinear glossed text creation for low-resource languages using a retrieval-based LLM prompting approach, achieving results that beat a BERT-based baseline in morpheme-level scores for all seven languages and outperforming a challenge winner in word-level scores for five languages.
Partly automated creation of interlinear glossed text (IGT) has the potential to assist in linguistic documentation. We argue that LLMs can make this process more accessible to linguists because of their capacity to follow natural-language instructions. We investigate the effectiveness of a retrieval-based LLM prompting approach to glossing, applied to the seven languages from the SIGMORPHON 2023 shared task. Our system beats the BERT-based shared task baseline for every language in the morpheme-level score category, and we show that a simple 3-best oracle has higher word-level scores than the challenge winner (a tuned sequence model) in five languages. In a case study on Tsez, we ask the LLM to automatically create and follow linguistic instructions, reducing errors on a confusing grammatical feature. Our results thus demonstrate the potential contributions which LLMs can make in interactive systems for glossing, both in making suggestions to human annotators and following directions.