Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search
This work addresses the challenge of protoform reconstruction for linguists, offering a more accurate and plausible method, though it appears incremental as it builds on existing data-driven approaches by integrating rule-based constraints.
The paper tackles the problem of reconstructing ancestral word forms (protoforms) from modern languages by proposing an unsupervised hybrid method that combines rule-based heuristics with evolutionary search. It achieves substantial improvements over baselines in character-level accuracy and phonological plausibility metrics on a dataset of Romance language cognates.
We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their predominantly data-driven nature. In contrast, our model integrates data-driven inference with rule-based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivated constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established baselines across both character-level accuracy and phonological plausibility metrics.