Explanations of Large Language Models Explain Language Representations in the Brain
This work addresses the challenge of understanding neural language mechanisms for cognitive neuroscience and AI researchers, offering an incremental approach to strengthen the link between LLMs and brain activity.
The study tackled the problem of linking large language models (LLMs) to brain language processing by using explainable AI (XAI) attribution methods to predict fMRI data from narrative listening, finding that these methods robustly predict brain activity with a hierarchical alignment across layers.
Large language models (LLMs) not only exhibit human-like performance but also share computational principles with the brain's language processing mechanisms. While prior research has focused on mapping LLMs' internal representations to neural activity, we propose a novel approach using explainable AI (XAI) to strengthen this link. Applying attribution methods, we quantify the influence of preceding words on LLMs' next-word predictions and use these explanations to predict fMRI data from participants listening to narratives. We find that attribution methods robustly predict brain activity across the language network, revealing a hierarchical pattern: explanations from early layers align with the brain's initial language processing stages, while later layers correspond to more advanced stages. Additionally, layers with greater influence on next-word prediction$\unicode{x2014}$reflected in higher attribution scores$\unicode{x2014}$demonstrate stronger brain alignment. These results underscore XAI's potential for exploring the neural basis of language and suggest brain alignment for assessing the biological plausibility of explanation methods.