CLNov 15, 2024
Entropy and type-token ratio in gigaword corporaPablo Rosillo-Rodes, Maxi San Miguel, David Sanchez
There are different ways of measuring diversity in complex systems. In particular, in language, lexical diversity is characterized in terms of the type-token ratio and the word entropy. We here investigate both diversity metrics in six massive linguistic datasets in English, Spanish, and Turkish, consisting of books, news articles, and tweets. These gigaword corpora correspond to languages with distinct morphological features and differ in registers and genres, thus constituting a varied testbed for a quantitative approach to lexical diversity. We unveil an empirical functional relation between entropy and type-token ratio of texts of a given corpus and language, which is a consequence of the statistical laws observed in natural language. Further, in the limit of large text lengths we find an analytical expression for this relation relying on both Zipf and Heaps laws that agrees with our empirical findings.
CLMay 9, 2024
Computational lexical analysis of Flamenco genresPablo Rosillo-Rodes, Maxi San Miguel, David Sanchez
Flamenco, recognized by UNESCO as part of the Intangible Cultural Heritage of Humanity, is a profound expression of cultural identity rooted in Andalusia, Spain. However, there is a lack of quantitative studies that help identify characteristic patterns in this long-lived music tradition. In this work, we present a computational analysis of Flamenco lyrics, employing natural language processing and machine learning to categorize over 2000 lyrics into their respective Flamenco genres, termed as $\textit{palos}$. Using a Multinomial Naive Bayes classifier, we find that lexical variation across styles enables to accurately identify distinct $\textit{palos}$. More importantly, from an automatic method of word usage, we obtain the semantic fields that characterize each style. Further, applying a metric that quantifies the inter-genre distance we perform a network analysis that sheds light on the relationship between Flamenco styles. Remarkably, our results suggest historical connections and $\textit{palo}$ evolutions. Overall, our work illuminates the intricate relationships and cultural significance embedded within Flamenco lyrics, complementing previous qualitative discussions with quantitative analyses and sparking new discussions on the origin and development of traditional music genres.