The Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLMs
This work addresses a fundamental mechanism in multilingual AI processing, offering insights into neural network interpretability for researchers and practitioners, though it is incremental as it builds on existing frameworks.
The paper tackles the underexplored internal dynamics of how multilingual LLMs transform language-specific representations, proposing and validating the Transfer Neurons Hypothesis that identifies neurons responsible for transferring between language-specific and shared semantic latent spaces, and showing these neurons are critical for reasoning.
Recent studies have suggested a processing framework for multilingual inputs in decoder-based LLMs: early layers convert inputs into English-centric and language-agnostic representations; middle layers perform reasoning within an English-centric latent space; and final layers generate outputs by transforming these representations back into language-specific latent spaces. However, the internal dynamics of such transformation and the underlying mechanism remain underexplored. Towards a deeper understanding of this framework, we propose and empirically validate The Transfer Neurons Hypothesis: certain neurons in the MLP module are responsible for transferring representations between language-specific latent spaces and a shared semantic latent space. Furthermore, we show that one function of language-specific neurons, as identified in recent studies, is to facilitate movement between latent spaces. Finally, we show that transfer neurons are critical for reasoning in multilingual LLMs.