CL AIMay 26

An In-Vitro Study on Cross-Lingual Generalization in Language Models

arXiv:2605.2668372.2

AI Analysis

This work provides a controlled methodology to disentangle factors affecting cross-lingual transfer, offering insights for improving multilingual model design.

The paper introduces an in-vitro framework to study cross-lingual transfer in language models, finding that transfer is governed more by tokenization preserving reusable substructure than by lexical similarity or tokenizer balance, with smaller vocabularies improving masked transfer.

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.

View on arXiv PDF

Similar