CLMay 2, 2020

Predicting Performance for Natural Language Processing Tasks

arXiv:2005.00870v11011 citations
AI Analysis

This addresses the challenge of efficient model evaluation for NLP researchers, though it is incremental as it applies existing regression methods to a new domain.

The paper tackles the problem of computationally prohibitive exhaustive testing of NLP models across diverse tasks, languages, and domains by building regression models to predict evaluation scores without actual training or testing, finding that these predictors outperform baselines and human experts on 9 NLP tasks.

Given the complexity of combinations of tasks, languages, and domains in natural language processing (NLP) research, it is computationally prohibitive to exhaustively test newly proposed models on each possible experimental setting. In this work, we attempt to explore the possibility of gaining plausible judgments of how well an NLP model can perform under an experimental setting, without actually training or testing the model. To do so, we build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input. Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures, outperforming reasonable baselines as well as human experts. Going further, we outline how our predictor can be used to find a small subset of representative experiments that should be run in order to obtain plausible predictions for all other experimental settings.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes