Understanding and Mitigating Classification Errors Through Interpretable Token Patterns
This work addresses the need for interpretable error analysis in NLP to improve classifier reliability, offering a novel approach for practitioners, though it is incremental in building on existing interpretability methods.
The paper tackles the problem of characterizing systematic errors in NLP classifiers by discovering interpretable token patterns that distinguish correct and erroneous predictions, proposing a method called Premise based on the Minimum Description Length principle, which recovers ground truth even on imbalanced data and provides actionable insights in VQA and NER case studies.
State-of-the-art NLP methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions as to obtain global and interpretable descriptions for arbitrary NLP classifiers. We formulate the problem of finding a succinct and non-redundant set of such patterns in terms of the Minimum Description Length principle. Through an extensive set of experiments, we show that our method, Premise, performs well in practice. Unlike existing solutions, it recovers ground truth, even on highly imbalanced data over large vocabularies. In VQA and NER case studies, we confirm that it gives clear and actionable insight into the systematic errors made by NLP classifiers.