CLAIMay 29, 2025

SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

arXiv:2505.23714v26 citationsh-index: 15Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Originality Incremental advance
AI Analysis

It addresses the problem of limited benchmarks for low-resource languages in NLP, enabling more effective polysemy disambiguation and transfer studies, though it is incremental as it builds on existing WiC formats.

This paper tackles the lack of high-quality evaluation datasets for low-resource languages in cross-lingual transfer by releasing sense-annotated datasets for ten low-resource languages and demonstrating their utility in Word-in-Context experiments.

This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation in low-resource settings and transfer studies. The released datasets and code aim to support further research into fair, robust, and truly multilingual NLP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes