XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
This provides a nuanced evaluation framework for researchers in multilingual NLP to better understand cross-lingual transfer learning, though it is incremental as it builds on an existing benchmark.
The paper tackles the need for more challenging multilingual evaluation by extending the XTREME benchmark to XTREME-R, which includes ten improved tasks covering 50 diverse languages and introduces tools like a diagnostic suite and interactive leaderboard to analyze model performance.
Machine learning has brought striking advances in multilingual natural language processing capabilities over the past year. For example, the latest techniques have improved the state-of-the-art performance on the XTREME multilingual benchmark by more than 13 points. While a sizeable gap to human-level performance remains, improvements have been easier to achieve in some tasks than in others. This paper analyzes the current state of cross-lingual transfer learning and summarizes some lessons learned. In order to catalyze meaningful progress, we extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks, including challenging language-agnostic retrieval tasks, and covers 50 typologically diverse languages. In addition, we provide a massively multilingual diagnostic suite (MultiCheckList) and fine-grained multi-dataset evaluation capabilities through an interactive public leaderboard to gain a better understanding of such models. The leaderboard and code for XTREME-R will be made available at https://sites.research.google/xtreme and https://github.com/google-research/xtreme respectively.