LG AIJun 1, 2022

Interpretability Guarantees with Merlin-Arthur Classifiers

Stephan Wäldchen, Kartikey Sharma, Berkant Turan, Max Zimmer, Sebastian Pokutta

arXiv:2206.00759v39.68 citationsh-index: 29Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of ensuring interpretability in AI systems for users who need trustworthy decisions, though it is incremental as it builds on existing interactive setups with new concepts.

The paper tackles the problem of providing interpretability guarantees for complex classifiers like neural networks by proposing an interactive multi-agent classifier inspired by the Merlin-Arthur protocol, resulting in provable lower bounds on mutual information between features and decisions without relying on optimal agents or independent features.

We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.

View on arXiv PDF Code

Similar