CLOct 22, 2025

Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+

York Hay Ng, Aditya Khan, Xiang Lu, Matteo Salloum, Michael Zhou, Phuong H. Hoang, A. Seza Doğruöz, En-Shiun Annie Lee

U of Toronto

arXiv:2510.19217v14.91 citationsh-index: 2

Originality Incremental advance

AI Analysis

This work addresses limitations in cross-lingual transfer tools for multilingual NLP research, offering incremental improvements over existing methods.

The paper tackled the problem of existing linguistic knowledge bases having ill-suited vector representations and lacking principled aggregation methods for cross-lingual transfer by introducing a framework with structure-aware representations and a composite distance, resulting in improved performance across NLP tasks.

Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.

View on arXiv PDF

Similar