CLMay 5

Rational Communication Shapes Morphological Composition

Fengyuan Yang, Yongqian Peng, Yuxi Ma, Chenheng Xu, Yixin Zhu

arXiv:2605.0351024.4

AI Analysis

For linguists and cognitive scientists, it extends rational communication models from utterance-level choice to word-internal structure, explaining why languages prefer certain morpheme combinations over others.

This paper shows that morphological composition in English (compounds and derivations) is shaped by a trade-off between listener recoverability and speaker production cost, with attested compositions ranked above unattested alternatives across 4323 examples from 1820–2019. The Pragmatic Speaker model outperforms semantic-only and cost-only baselines, with MRR and top-k accuracy improvements growing as candidate sets expand.

Human languages expand vocabularies by combining existing morphemes rather than inventing arbitrary forms. Communicative efficiency shapes lexical systems at multiple levels (Gibson et al., 2019), yet morphological composition -- combining morphemes through compounding or affixation -- has rarely been modeled as a historically situated speaker choice among competing morpheme sequences, leaving unanswered why a language settles on one morpheme combination over other plausible alternatives. We ask whether a trade-off between listener recoverability and speaker production cost can predict attested compositions over contemporaneously available alternatives. Here we show, within the Rational Speech Act (RSA) framework (Frank & Goodman, 2012; Goodman & Frank, 2016) using a time-indexed lexicon constructed from Corpus of Historical American English (COHA) and Corpus of Contemporary American English (COCA), that across 4323 naturally occurring English compounds and derivations spanning 1820--2019, attested compositions are systematically ranked above unattested alternatives generated from contemporaneously available morphemes. Models integrating semantic informativeness with production cost outperform semantic-only and cost-only baselines on Mean Reciprocal Rank (MRR) and top-k accuracy (Acc@k), with the advantage of the Pragmatic Speaker model ($S_1$) over the semantic-only baseline growing as the candidate set expands, where meaning alone leaves morphological choice underdetermined. These findings suggest that lexicalization reflects a communicative trade-off between expressiveness and efficiency, extending rational accounts of communication from utterance-level choice to the internal structure of words.

View on arXiv PDF

Similar