Our Evaluation Metric Needs an Update to Encourage Generalization
This tackles the issue of inflated model performance for AI researchers and practitioners, though it appears incremental as it focuses on evaluation rather than training methods.
The paper addresses the problem of models overfitting to dataset biases and performing poorly on out-of-distribution data, proposing a novel evaluation metric called WOOD Score to encourage generalization and mitigate overestimation of AI capabilities.
Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack' datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance -- and thus overestimation in AI systems' capabilities -- we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.