Evaluating Paraphrastic Robustness in Textual Entailment Models
This addresses the issue of model reliability in natural language understanding for researchers and practitioners, but it is incremental as it focuses on evaluation rather than new methods.
The authors tackled the problem of evaluating whether textual entailment models are robust to paraphrasing by creating PaRTE, a dataset of 1,126 pairs, and found that contemporary models change predictions on 8-16% of paraphrased examples, indicating room for improvement.
We present PaRTE, a collection of 1,126 pairs of Recognizing Textual Entailment (RTE) examples to evaluate whether models are robust to paraphrasing. We posit that if RTE models understand language, their predictions should be consistent across inputs that share the same meaning. We use the evaluation set to determine if RTE models' predictions change when examples are paraphrased. In our experiments, contemporary models change their predictions on 8-16\% of paraphrased examples, indicating that there is still room for improvement.