A Rule-based Computational Model for Gaidhlig Morphology
This work addresses the lack of data for lesser-used languages by providing an interpretable, rule-based approach that could aid in teaching and tool development, though it is incremental as it adapts existing Wiktionary data.
The paper tackles the problem of supporting low-resource languages like Gaidhlig by constructing a rule-based computational model for its morphology using data from Wiktionary, resulting in a system that derives inflected forms with Python utilities and SQL queries for educational and parsing tools.
Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.