CLApr 12, 2022

Quantified Reproducibility Assessment of NLP Results

arXiv:2204.05961v132.0641 citationsh-index: 23

Originality Incremental advance

AI Analysis

This addresses the issue of reproducibility in NLP research, providing a standardized metric for researchers to evaluate and improve the reliability of their results, though it is incremental as it builds on existing metrology frameworks.

The paper tackles the problem of assessing reproducibility in NLP by proposing a quantified reproducibility assessment (QRA) method based on metrology concepts, which produces a single score to estimate reproducibility across different systems and evaluation measures, tested on 18 combinations with up to seven reproductions each.

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology. QRA produces a single score estimating the degree of reproducibility of a given system and evaluation measure, on the basis of the scores from, and differences between, different reproductions. We test QRA on 18 system and evaluation measure combinations (involving diverse NLP tasks and types of evaluation), for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies. We find that the proposed method facilitates insights into causes of variation between reproductions, and allows conclusions to be drawn about what changes to system and/or evaluation design might lead to improved reproducibility.

View on arXiv PDF

Similar