Identifying and Benchmarking Natural Out-of-Context Prediction Problems
This work addresses the issue of unreliable predictions on uncommon inputs for deep learning practitioners, though it is incremental as it builds on existing benchmarks and focuses on measurement and identification rather than solving the problem.
The paper tackles the problem of deep learning failures on out-of-context (OOC) prediction by introducing a framework to unify OOC performance measurement and identify candidate OOC examples using auxiliary information, resulting in the creation of NOOCh, a suite of naturally-occurring challenge sets to probe specific failure modes and explore tradeoffs between learning approaches.
Deep learning systems frequently fail at out-of-context (OOC) prediction, the problem of making reliable predictions on uncommon or unusual inputs or subgroups of the training distribution. To this end, a number of benchmarks for measuring OOC performance have recently been introduced. In this work, we introduce a framework unifying the literature on OOC performance measurement, and demonstrate how rich auxiliary information can be leveraged to identify candidate sets of OOC examples in existing datasets. We present NOOCh: a suite of naturally-occurring "challenge sets", and show how varying notions of context can be used to probe specific OOC failure modes. Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions.