CLJul 3, 2020

El Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

arXiv:2007.01955v1
Originality Incremental advance
AI Analysis

This addresses the challenge of limited data for less-resourced languages in NLP, offering an incremental improvement for building more effective language models.

The study investigated how machine-translated corpora affect language models in machine reading comprehension tasks, finding that careful curation and post-processing of translated Spanish SQuAD datasets improved performance and robustness, with multilingual models showing more resilience to translation artifacts in exact match scores.

Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA transfer-learning evaluation question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes