A Modular Architecture for Typologically Controlled Lexicon Generation
This provides a reproducible, controllable tool for computational linguists needing artificial lexicons for experiments, though the contribution is incremental as it combines existing resources (PHOIBLE, OT/MaxEnt grammars) in a modular pipeline.
The authors propose a modular framework for generating typologically plausible, pronounceable artificial lexicons with explicit phonotactic and semantic control, demonstrating that probabilistic grammars outperform deterministic and random baselines on phonotactic coherence and typological realism across lexicon sizes of 100–5,000 forms.
Constructing artificial lexicons that are pronounceable, typologically plausible, and semantically structured remains an open challenge in computational linguistics. Existing conlang generators either lack formal phonotactic guarantees or delegate generation to opaque, non-reproducible LLM-based pipelines. We propose a modular framework that samples phoneme inventories from PHOIBLE, generates word forms under interchangeable phonological grammars (deterministic, OT, and MaxEnt), and assigns meanings via a Swadesh--Leipzig--Jakarta ontology with explicit form--meaning alignment. Evaluation on character $n$-gram perplexity, log-likelihood, and KL divergence against PHOIBLE across lexicon sizes of 100-5,000 forms shows that probabilistic grammars consistently outperform deterministic and random baselines on both phonotactic coherence and typological realism.