CLOct 3, 2025

Morpheme Induction for Emergent Language

CMU
arXiv:2510.03439v11 citationsh-index: 4EMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of analyzing emergent languages for researchers in computational linguistics and AI, but it is incremental as it builds on existing methods for morpheme induction.

The paper tackles the problem of inducing morphemes from emergent language corpora by introducing CSAR, a greedy algorithm that selects morphemes based on mutual information between forms and meanings, and validates its effectiveness on procedurally generated datasets and human language data, showing reasonable predictions and analyzing linguistic characteristics like synonymy and polysemy.

We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes