CLOct 1, 2021

Expected Validation Performance and Estimation of a Random Variable's Maximum

arXiv:2110.00613v1662 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of improving result reporting and reproducibility in NLP research by providing insights into estimator selection for validation performance, though it is incremental as it extends prior focus on bias to include variance and MSE.

The paper analyzed three statistical estimators for expected validation performance, a tool used to report model performance as a function of computational budget, and found that the estimator with the smallest mean squared error (MSE) balanced bias and variance, while biased estimators led to the fewest incorrect conclusions in model comparisons.

Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments). Where previous work analyzing such estimators focused on the bias, we also examine the variance and mean squared error (MSE). In both synthetic and realistic scenarios, we evaluate three estimators and find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias; the estimator with the smallest MSE strikes a balance between bias and variance, displaying a classic bias-variance tradeoff. We use expected validation performance to compare between different models, and analyze how frequently each estimator leads to drawing incorrect conclusions about which of two models performs best. We find that the two biased estimators lead to the fewest incorrect conclusions, which hints at the importance of minimizing variance and MSE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes