Discovering Chunks in Neural Embeddings for Interpretability
This provides a new interpretability method for neural networks, addressing a foundational challenge in AI, though it is incremental in applying cognitive principles to artificial systems.
The paper tackles the problem of interpreting neural networks by proposing a chunking framework inspired by human cognition, demonstrating that hidden states in RNNs and LLMs reflect input patterns as identifiable chunks, with perturbations affecting associated concepts.
Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.