CLOct 10, 2019

R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason

Naoya Inoue, Pontus Stenetorp, Kentaro Inui

arXiv:1910.04601v21013 citations

Originality Incremental advance

AI Analysis

This addresses the issue of reliably measuring progress in reading comprehension systems for the AI research community, though it is incremental as it builds on existing evaluation tasks.

The paper tackles the problem of reading comprehension systems exploiting biases in datasets by introducing R4C, a benchmark that requires answers and derivations, resulting in a publicly released dataset of 4.6k questions with 13.8k derivations and reliable automatic evaluation metrics.

Recent studies have revealed that reading comprehension (RC) systems learn to exploit annotation artifacts and other biases in current datasets. This prevents the community from reliably measuring the progress of RC systems. To address this issue, we introduce R4C, a new task for evaluating RC systems' internal reasoning. R4C requires giving not only answers but also derivations: explanations that justify predicted answers. We present a reliable, crowdsourced framework for scalably annotating RC datasets with derivations. We create and publicly release the R4C dataset, the first, quality-assured dataset consisting of 4.6k questions, each of which is annotated with 3 reference derivations (i.e. 13.8k derivations). Experiments show that our automatic evaluation metrics using multiple reference derivations are reliable, and that R4C assesses different skills from an existing benchmark.

View on arXiv PDF

Similar