Phonotactic Complexity and its Trade-offs
This work addresses the challenge of cross-linguistic phonotactic analysis for linguists, but it is incremental as it builds on existing statistical models.
The paper tackled the problem of comparing phonotactic complexity across languages by introducing a measure of bits per phoneme, and found a strong negative correlation of -0.74 between this complexity and average word length in a dataset of 1016 words across 106 languages.
We present methods for calculating a measure of phonotactic complexity---bits per phoneme---that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language's phonotactics are. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.