LG AIJan 12, 2023

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

arXiv:2301.05217v355.3880 citationsh-index: 47Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of interpreting emergent behaviors in neural networks for researchers in mechanistic interpretability, though it is incremental as it builds on existing studies of grokking.

The paper tackled the problem of understanding the emergent phenomenon of 'grokking' in neural networks by reverse-engineering small transformers trained on modular addition tasks, revealing that grokking arises from gradual amplification of structured mechanisms and later removal of memorizing components, with training split into three continuous phases: memorization, circuit formation, and cleanup.

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

View on arXiv PDF Code

Similar