CL AIApr 2, 2024

Cross-lingual Text Classification Transfer: The Case of Ukrainian

Daryna Dementieva, Valeriia Khylenko, Georg Groh

arXiv:2404.02043v213.221 citationsh-index: 16COLING

Originality Synthesis-oriented

AI Analysis

This work addresses the lack of labeled datasets for Ukrainian, enabling fairer NLP development for this under-resourced language, though it is incremental in applying existing methods.

The paper tackles the problem of data scarcity for Ukrainian text classification by exploring cross-lingual knowledge transfer methods, achieving optimal setups for toxicity, formality, and NLI tasks without manual data curation.

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. To support further fair development of NLP models, exploring the possibilities of effective knowledge transfer to new languages is crucial. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks, i.e., different types of style, or harmful speech, or texts relationships. However, the amount of resources required for such corpora collection from scratch is understandable. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference (NLI) -- providing the ``recipe'' for the optimal setups for each task.

View on arXiv PDF

Similar