CLOct 31, 2025

Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+

Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu, A. Seza Doğruöz, En-Shiun Annie Lee

arXiv:2510.27183v12.7h-index: 2

Originality Synthesis-oriented

AI Analysis

This incremental work addresses data limitations for researchers using URIEL+ in cross-lingual transfer, particularly benefiting low-resource language studies.

The paper tackled data sparsity in the URIEL+ linguistic knowledge base by adding script vectors, integrating Glottolog for more languages, and expanding lineage imputation, resulting in a 14% reduction in feature sparsity, up to 19,015 additional languages, and up to 33% improvement in imputation quality.

The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.

View on arXiv PDF

Similar