WT5?! Training Text-to-Text Models to Explain their Predictions
This addresses the challenge of interpretability in NLP for researchers and practitioners, offering a practical method without modifying training procedures, though it builds incrementally on existing text-to-text frameworks.
The paper tackles the problem of making neural network predictions interpretable by training text-to-text models to generate natural language explanations alongside their predictions, achieving state-of-the-art results on explainability benchmarks and enabling learning from limited labeled data and cross-dataset transfer.
Neural networks have recently achieved human-level performance on various challenging natural language processing (NLP) tasks, but it is notoriously difficult to understand why a neural network produced a particular prediction. In this paper, we leverage the text-to-text framework proposed by Raffel et al.(2019) to train language models to output a natural text explanation alongside their prediction. Crucially, this requires no modifications to the loss function or training and decoding procedures -- we simply train the model to output the explanation after generating the (natural text) prediction. We show that this approach not only obtains state-of-the-art results on explainability benchmarks, but also permits learning from a limited set of labeled explanations and transferring rationalization abilities across datasets. To facilitate reproducibility and future work, we release our code use to train the models.