CLMar 3, 2025

Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

arXiv:2503.01625v22 citationsh-index: 8Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Originality Synthesis-oriented
AI Analysis

This work addresses the need for standardized annotation in linguistic typology, but it is incremental as it builds on existing methods for numeral analysis.

The researchers tackled the problem of comparing numeral systems across languages by developing a standardized coding scheme and workflow, analyzing 25 languages and finding that allomorphy causes segmentation errors and subword tokenization fails in low-resource scenarios.

Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes