Vocabulary shapes cross-lingual variation of word-order learnability in language models

Jonas Mayer Martins, Jaap Jumelet, Viola Priesemann, Lisa Beinborn

arXiv:2603.1942768.2h-index: 35

AI Analysis

This work addresses a fundamental question in computational linguistics about cross-lingual variation in word-order learnability, though it is incremental as it builds on existing transformer models and synthetic data approaches.

The researchers investigated why some languages allow free word order while others do not by pretraining transformer language models on synthetic word-order variants of natural languages, finding that greater word-order irregularity increases model surprisal (indicating reduced learnability) and that vocabulary structure, not a simple free/fixed distinction, strongly predicts this variation.

Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

View on arXiv PDF

Similar