CLAIApr 2, 2024

Cross-lingual Text Classification Transfer: The Case of Ukrainian

arXiv:2404.02043v221 citationsh-index: 16COLING
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of labeled datasets for Ukrainian, enabling fairer NLP development for this under-resourced language, though it is incremental in applying existing methods.

The paper tackles the problem of data scarcity for Ukrainian text classification by exploring cross-lingual knowledge transfer methods, achieving optimal setups for toxicity, formality, and NLI tasks without manual data curation.

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference (NLI) -- providing the ``recipe'' for the optimal setups for each task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes