The Distribution of Phoneme Frequencies across the World's Languages: Macroscopic and Microscopic Information-Theoretic Models
This work addresses a fundamental problem in linguistics by offering a comprehensive model for phoneme frequencies, which is incremental as it builds on existing information-theoretic approaches.
The study tackled the problem of explaining phoneme frequency distributions across languages, finding that macroscopic patterns follow a symmetric Dirichlet distribution with a scaling concentration parameter and microscopic patterns are predicted by a Maximum Entropy model incorporating articulatory, phonotactic, and lexical constraints, providing a unified information-theoretic account.
We demonstrate that the frequency distribution of phonemes across languages can be explained at both macroscopic and microscopic levels. Macroscopically, phoneme rank-frequency distributions closely follow the order statistics of a symmetric Dirichlet distribution whose single concentration parameter scales systematically with phonemic inventory size, revealing a robust compensation effect whereby larger inventories exhibit lower relative entropy. Microscopically, a Maximum Entropy model incorporating constraints from articulatory, phonotactic, and lexical structure accurately predicts language-specific phoneme probabilities. Together, these findings provide a unified information-theoretic account of phoneme frequency structure.