Less is More: The Effectiveness of Compact Typological Language Representations
This work addresses the challenge of effectively modeling cross-lingual relationships for NLP applications, particularly benefiting low-resource languages, but it is incremental as it builds on existing URIEL+ data with optimization techniques.
The authors tackled the problem of high dimensionality and sparsity in linguistic feature datasets like URIEL+, which limit distance metrics, by proposing a pipeline for feature selection and imputation to create compact typological representations. The result showed that these reduced-size representations yield more informative distance metrics and improve performance in multilingual NLP applications.
Linguistic feature datasets such as URIEL+ are valuable for modelling cross-lingual relationships, but their high dimensionality and sparsity, especially for low-resource languages, limit the effectiveness of distance metrics. We propose a pipeline to optimize the URIEL+ typological feature space by combining feature selection and imputation, producing compact yet interpretable typological representations. We evaluate these feature subsets on linguistic distance alignment and downstream tasks, demonstrating that reduced-size representations of language typology can yield more informative distance metrics and improve performance in multilingual NLP applications.