CLSep 14, 2022

Drawing Causal Inferences About Performance Effects in NLP

arXiv:2209.06790v11 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This addresses a foundational methodological issue in NLP science for researchers aiming to make generalizable claims about method efficacy, though it is incremental as it builds on existing causal inference concepts.

The paper tackles the problem that NLP research often fails to draw causal inferences about method performance effects due to comparing only a few models from incomparable processing systems, proposing a procedure involving random sampling and expected generalization errors to estimate average treatment effects.

This article emphasizes that NLP as a science seeks to make inferences about the performance effects that result from applying one method (compared to another method) in the processing of natural language. Yet NLP research in practice usually does not achieve this goal: In NLP research articles, typically only a few models are compared. Each model results from a specific procedural pipeline (here named processing system) that is composed of a specific collection of methods that are used in preprocessing, pretraining, hyperparameter tuning, and training on the target task. To make generalizing inferences about the performance effect that is caused by applying some method A vs. another method B, it is not sufficient to compare a few specific models that are produced by a few specific (probably incomparable) processing systems. Rather, the following procedure would allow drawing inferences about methods' performance effects: (1) A population of processing systems that researchers seek to infer to has to be defined. (2) A random sample of processing systems from this population is drawn. (The drawn processing systems in the sample will vary with regard to the methods they apply along their procedural pipelines and also will vary regarding the compositions of their training and test data sets used for training and evaluation.) (3) Each processing system is applied once with method A and once with method B. (4) Based on the sample of applied processing systems, the expected generalization errors of method A and method B are approximated. (5) The difference between the expected generalization errors of method A and method B is the estimated average treatment effect due to applying method A compared to method B in the population of processing systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes