REFER: An End-to-end Rationale Extraction Framework for Explanation Regularization
This work addresses the need for better explanation methods in NLP, offering a framework that enhances both model interpretability and performance, though it is incremental in nature.
The paper tackles the problem of extracting faithful and plausible rationales in explainable NLP by proposing REFER, an end-to-end framework that jointly trains the task model and rationale extractor, achieving improvements in faithfulness, plausibility, and task accuracy, with composite normalized relative gains of 11% on e-SNLI and 3% on CoS-E over baselines.
Human-annotated textual explanations are becoming increasingly important in Explainable Natural Language Processing. Rationale extraction aims to provide faithful (i.e., reflective of the behavior of the model) and plausible (i.e., convincing to humans) explanations by highlighting the inputs that had the largest impact on the prediction without compromising the performance of the task model. In recent works, the focus of training rationale extractors was primarily on optimizing for plausibility using human highlights, while the task model was trained on jointly optimizing for task predictive accuracy and faithfulness. We propose REFER, a framework that employs a differentiable rationale extractor that allows to back-propagate through the rationale extraction process. We analyze the impact of using human highlights during training by jointly training the task model and the rationale extractor. In our experiments, REFER yields significantly better results in terms of faithfulness, plausibility, and downstream task accuracy on both in-distribution and out-of-distribution data. On both e-SNLI and CoS-E, our best setting produces better results in terms of composite normalized relative gain than the previous baselines by 11% and 3%, respectively.