CLAIDec 13, 2025

Adversarially Probing Cross-Family Sound Symbolism in 27 Languages

arXiv:2512.12245v1
Originality Incremental advance
AI Analysis

This addresses the need for large-scale empirical evidence on sound symbolism in linguistics, providing a foundational dataset and tools for future studies, though it is incremental in applying computational methods to an established phenomenon.

The paper tackled the problem of testing sound symbolism at scale by conducting the first computational cross-linguistic analysis of size semantics across 27 languages, finding that phonological form predicts size above chance even across unrelated languages, with language prediction falling below chance while size prediction remained significantly above chance.

The phenomenon of sound symbolism, the non-arbitrary mapping between word sounds and meanings, has long been demonstrated through anecdotal experiments like Bouba Kiki, but rarely tested at scale. We present the first computational cross-linguistic analysis of sound symbolism in the semantic domain of size. We compile a typologically broad dataset of 810 adjectives (27 languages, 30 words each), each phonemically transcribed and validated with native-speaker audio. Using interpretable classifiers over bag-of-segment features, we find that phonological form predicts size semantics above chance even across unrelated languages, with both vowels and consonants contributing. To probe universality beyond genealogy, we train an adversarial scrubber that suppresses language identity while preserving size signal (also at family granularity). Language prediction averaged across languages and settings falls below chance while size prediction remains significantly above chance, indicating cross-family sound-symbolic bias. We release data, code, and diagnostic tools for future large-scale studies of iconicity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes