Understanding Task Representations in Neural Networks via Bayesian Ablation
This work addresses the problem of interpretability in neural networks for researchers in cognitive modeling and AI, offering a novel method for analyzing latent representations.
The authors tackled the challenge of interpreting learned task representations in neural networks by introducing a probabilistic framework based on Bayesian inference, which infers causal contributions of representational units to task performance and provides tools to analyze properties like distributedness and complexity.
Neural networks are powerful tools for cognitive modeling due to their flexibility and emergent properties. However, interpreting their learned representations remains challenging due to their sub-symbolic semantics. In this work, we introduce a novel probabilistic framework for interpreting latent task representations in neural networks. Inspired by Bayesian inference, our approach defines a distribution over representational units to infer their causal contributions to task performance. Using ideas from information theory, we propose a suite of tools and metrics to illuminate key model properties, including representational distributedness, manifold complexity, and polysemanticity.