Hiram Ring

CL
h-index1
3papers
3citations
Novelty32%
AI Score27

3 Papers

CLMay 18, 2025
The taggedPBC: Annotating a massive parallel corpus for crosslinguistic investigations

Hiram Ring

Existing datasets available for crosslinguistic investigations have tended to focus on large amounts of data for a small group of languages or a small amount of data for a large number of languages. This means that claims based on these datasets are limited in what they reveal about universal properties of the human language faculty. While this has begun to change through the efforts of projects seeking to develop tagged corpora for a large number of languages, such efforts are still constrained by limits on resources. The current paper reports on a large tagged parallel dataset which has been developed to partially address this issue. The taggedPBC contains POS-tagged parallel text data from more than 1,940 languages, representing 155 language families and 78 isolates, dwarfing previously available resources. The accuracy of particular tags in this dataset is shown to correlate well with both existing SOTA taggers for high-resource languages (SpaCy, Trankit) as well as hand-tagged corpora (Universal Dependencies Treebanks). Additionally, a novel measure derived from this dataset, the N1 ratio, correlates with expert determinations of intransitive word order in three typological databases (WALS, Grambank, Autotyp) such that a Gaussian Naive Bayes classifier trained on this feature can accurately identify basic intransitive word order for languages not in those databases. While much work is still needed to expand and develop this dataset, the taggedPBC is an important step to enable corpus-based crosslinguistic investigations, and is made available for research and collaboration via GitHub.

CLMay 20, 2025
Word length predicts word order: "Min-max"-ing drives language evolution

Hiram Ring

Current theories of language propose an innate (Baker 2001; Chomsky 1981) or a functional (Greenberg 1963; Dryer 2007; Hawkins 2014) origin for the surface structures (i.e. word order) that we observe in languages of the world, while evolutionary modeling (Dunn et al. 2011) suggests that descent is the primary factor influencing such patterns. Although there are hypotheses for word order change from both innate and usage-based perspectives for specific languages and families, there are key disagreements between the two major proposals for mechanisms that drive the evolution of language more broadly (Wasow 2002; Levy 2008). This paper proposes a universal underlying mechanism for word order change based on a large tagged parallel dataset of over 1,500 languages representing 133 language families and 111 isolates. Results indicate that word class length is significantly correlated with word order crosslinguistically, but not in a straightforward manner, partially supporting opposing theories of processing, while at the same time predicting historical word order change in two different phylogenetic lines and explaining more variance than descent or language area in regression models. Such findings suggest an integrated "Min-Max" theory of language evolution driven by competing pressures of processing and information structure, aligning with recent efficiency-oriented (Levshina 2023) and information-theoretic proposals (Zaslavsky 2020; Tucker et al. 2025).

CLSep 23, 2025
Are most sentences unique? An empirical examination of Chomskyan claims

Hiram Ring

A repeated claim in linguistics is that the majority of linguistic utterances are unique. For example, Pinker (1994: 10), summarizing an argument by Noam Chomsky, states that "virtually every sentence that a person utters or understands is a brand-new combination of words, appearing for the first time in the history of the universe." With the increased availability of large corpora, this is a claim that can be empirically investigated. The current paper addresses the question by using the NLTK Python library to parse corpora of different genres, providing counts of exact string matches in each. Results show that while completely unique sentences are often the majority of corpora, this is highly constrained by genre, and that duplicate sentences are not an insignificant part of any individual corpus.