CLMay 9, 2024

Using Machine Translation to Augment Multilingual Classification

arXiv:2405.05478v113.223 citationsEAMT
Originality Incremental advance
AI Analysis

This addresses the problem of data scarcity for multilingual classification, but it is incremental as it builds on existing translation and loss techniques.

The study tackled the bottleneck of needing annotated training data for multilingual text classification by using machine translation to generate labeled data in multiple languages, showing that translated data is sufficient for tuning classifiers and a novel loss technique offers some improvement.

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes