Phonological distances for linguistic typology and the origin of Indo-European languages

Marius Mavridis, Juan De Gregorio, Raul Toral, David Sanchez

arXiv:2604.1156528.3h-index: 3

AI Analysis

For linguists and evolutionary biologists, it provides a new quantitative method for typology and language origin inference.

The paper shows that short-range phoneme dependencies capture large-scale patterns of linguistic relatedness, enabling a phonological distance matrix that recovers major language families and constrains the Indo-European homeland to the Steppe region.

We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.

View on arXiv PDF

Similar