CLAug 9, 2024

Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension

Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro

arXiv:2408.05023v11.0h-index: 23

Originality Incremental advance

AI Analysis

This addresses the issue of spurious correlations and lack of diversity in NLP evaluations for researchers and practitioners, though it is incremental as it builds on existing synthetic data methods.

The paper tackles the problem of evaluating machine reading comprehension (MRC) models by proposing a training-set free framework using synthetically generated challenge sets, finding that these sets can compete with crowd-sourced datasets in naturalness and lexical diversity, and showing that state-of-the-art models can succeed on them without fully capturing the underlying linguistic phenomena.

Performance of NLP systems is typically evaluated by collecting a large-scale dataset by means of crowd-sourcing to train a data-driven model and evaluate it on a held-out portion of the data. This approach has been shown to suffer from spurious correlations and the lack of challenging examples that represent the diversity of natural language. Instead, we examine a framework for evaluating optimised models in training-set free setting on synthetically generated challenge sets. We find that despite the simplicity of the generation method, the data can compete with crowd-sourced datasets with regard to naturalness and lexical diversity for the purpose of evaluating the linguistic capabilities of MRC models. We conduct further experiments and show that state-of-the-art language model-based MRC systems can learn to succeed on the challenge set correctly, although, without capturing the general notion of the evaluated phenomenon.

View on arXiv PDF

Similar