Juan De Gregorio

CL
h-index3
4papers
12citations
Novelty43%
AI Score38

4 Papers

CLAug 23, 2022
Ordinal analysis of lexical patterns

David Sanchez, Luciano Zunino, Juan De Gregorio et al.

Words are fundamental linguistic units that connect thoughts and things through meaning. However, words do not appear independently in a text sequence. The existence of syntactic rules induces correlations among neighboring words. Using an ordinal pattern approach, we present an analysis of lexical statistical connections for 11 major languages. We find that the diverse manners that languages utilize to express word relations give rise to unique pattern structural distributions. Furthermore, fluctuations of these pattern distributions for a given language can allow us to determine both the historical period when the text was written and its author. Taken together, our results emphasize the relevance of ordinal time series analysis in linguistic typology, historical linguistics and stylometry.

28.3CLApr 13
Phonological distances for linguistic typology and the origin of Indo-European languages

Marius Mavridis, Juan De Gregorio, Raul Toral et al.

We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.

DATA-ANOct 13, 2025
Information-theoretic analysis of temporal dependence in discrete stochastic processes: Application to precipitation predictability

Juan De Gregorio, David Sánchez, Raúl Toral

Understanding the temporal dependence of precipitation is key to improving weather predictability and developing efficient stochastic rainfall models. We introduce an information-theoretic approach to quantify memory effects in discrete stochastic processes and apply it to daily precipitation records across the contiguous United States. The method is based on the predictability gain, a quantity derived from block entropy that measures the additional information provided by higher-order temporal dependencies. This statistic, combined with a bootstrap-based hypothesis testing and Fisher's method, enables a robust memory estimator from finite data. Tests with generated sequences show that this estimator outperforms other model-selection criteria such as AIC and BIC. Applied to precipitation data, the analysis reveals that daily rainfall occurrence is well described by low-order Markov chains, exhibiting regional and seasonal variations, with stronger correlations in winter along the West Coast and in summer in the Southeast, consistent with known climatological patterns. Overall, our findings establish a framework for building parsimonious stochastic descriptions, useful when addressing spatial heterogeneity in the memory structure of precipitation dynamics, and support further advances in real-time, data-driven forecasting schemes.

CLMar 27, 2024
Exploring language relations through syntactic distances and geographic proximity

Juan De Gregorio, Raúl Toral, David Sánchez

Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.