CLApr 20, 2018

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

arXiv:1804.07461v38611 citations
Originality Synthesis-oriented
AI Analysis

This provides a standardized evaluation platform for researchers in natural language processing to assess model generality, though it is incremental as it aggregates existing tasks.

The authors introduced GLUE, a multi-task benchmark for evaluating natural language understanding models across diverse tasks, and found that current multi-task and transfer learning methods did not significantly outperform training separate models per task, indicating room for improvement.

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

Code Implementations11 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes