LG AI HCNov 15, 2023

Wrapper Boxes: Faithful Attribution of Model Predictions to Training Data

Yiheng Su, Junyi Jessy Li, Matthew Lease

arXiv:2311.08644v33.81 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This enables contestable AI decisions by identifying responsible training data, addressing transparency needs for users of language models.

The authors tackled the problem of providing faithful explanations for neural model predictions by training interpretable 'wrapper box' models on learned neural features, achieving comparable predictive performance across seven language models while enabling direct attribution of decisions to specific training examples.

Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a "wrapper box'' pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.

View on arXiv PDF Code

Similar