LGDec 24, 2025
The Affine Divergence: Aligning Activation Updates Beyond NormalisationGeorge Bird
A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent. As intended, parameters update in their direction of steepest descent. However, activations are argued to constitute a more directly impactful quantity to prioritise in optimisation, as they are closer to the loss in the computational graph and carry sample-dependent information through the network. Yet their propagated updates do not take the optimal steepest-descent step. These quantities exhibit non-ideal sample-wise scaling across affine, convolutional, and attention layers. Solutions to correct for this are trivial and, entirely incidentally, derive normalisation from first principles despite motivational independence. Consequently, such considerations offer a fresh and conceptual reframe of normalisation's action, with auxiliary experiments bolstering this mechanistically. Moreover, this analysis makes clear a second possibility: a solution that is functionally distinct from modern normalisations, without scale-invariance, yet remains empirically successful, outperforming conventional normalisers across several tests. This is presented as an alternative to the affine map. This generalises to convolution via a new functional form, "PatchNorm", a compositionally inseparable normaliser. Together, these provide an alternative mechanistic framework that adds to, and counters some of, the discussion of normalisation. Further, it is argued that normalisers are better decomposed into activation-function-like maps with parameterised scaling, thereby aiding the prioritisation of representations during optimisation. Overall, this constitutes a theoretical-principled approach that yields several new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.
NEFeb 26
On De-Individuated Neurons: Continuous Symmetries Enable Dynamic TopologiesGeorge Bird
This paper introduces a novel methodology for dynamic networks by leveraging a new symmetry-principled class of primitives, isotropic activation functions. This approach enables real-time neuronal growth and shrinkage of the architectures in response to task demand. This is made possible by network structural changes that are invariant under symmetry reparameterisations, leaving the computation identical under neurogenesis and well approximated under neurodegeneration. This is undertaken by leveraging the isotropic primitives' property of basis independence, resulting in the loss of the individuated neurons implicit in the elementwise functional form. Isotropy thereby allows a freedom in the basis to which layers are decomposed and interpreted as individual artificial neurons. This enables a layer-wise diagonalisation procedure, in which typical interconnected layers, such as dense layers, convolutional kernels, and others, can be reexpressed so that neurons have one-to-one, ordered connectivity within alternating layers. This indicates which one-to-one neuron-to-neuron communications are strongly impactful on overall functionality and which are not. Inconsequential neurons can thus be removed (neurodegeneration), and new inactive scaffold neurons added (neurogenesis) whilst remaining analytically invariant in function. A new tunable model parameter, intrinsic length, is also introduced to ensure this analytical invariance. This approach mathematically equates connectivity pruning with neurodegeneration. The diagonalisation also offers new possibilities for mechanistic interpretability into isotropic networks, and it is demonstrated that isotropic dense networks can asymptotically reach a sparsity factor of 50% whilst retaining exact network functionality. Finally, the construction is generalised, demonstrating a nested functional class for this form of isotropic primitive architectures.
LGMay 9, 2025
The Spotlight Resonance Method: Resolving the Alignment of Embedded ActivationsGeorge Bird
Understanding how deep learning models represent data is currently difficult due to the limited number of methodologies available. This paper demonstrates a versatile and novel visualisation tool for determining the axis alignment of embedded data at any layer in any deep learning model. In particular, it evaluates the distribution around planes defined by the network's privileged basis vectors. This method provides both an atomistic and a holistic, intuitive metric for interpreting the distribution of activations across all planes. It ensures that both positive and negative signals contribute, treating the activation vector as a whole. Depending on the application, several variations of this technique are presented, with a resolution scale hyperparameter to probe different angular scales. Using this method, multiple examples are provided that demonstrate embedded representations tend to be axis-aligned with the privileged basis. This is not necessarily the standard basis, and it is found that activation functions directly result in privileged bases. Hence, it provides a direct causal link between functional form symmetry breaking and representational alignment, explaining why representations have a tendency to align with the neuron basis. Therefore, using this method, we begin to answer the fundamental question of what causes the observed tendency of representations to align with neurons. Finally, examples of so-called grandmother neurons are found in a variety of networks.
LGJul 16, 2025
Emergence of Quantised Representations Isolated to Anisotropic FunctionsGeorge Bird
This paper presents a novel methodology for determining representational structure, which builds upon the existing Spotlight Resonance method. This new tool is used to gain insight into how discrete representations can emerge and organise in autoencoder models, through a controlled ablation study in which only the activation function is altered. Using this technique, the validity of whether function-driven symmetries can act as implicit inductive biases on representations is determined. Representations are found to tend to discretise when the activation functions are defined through a discrete algebraic permutation-equivariant symmetry. In contrast, they remain continuous under a continuous algebraic orthogonal-equivariant definition. This confirms the hypothesis that the symmetries of network primitives can carry unintended inductive biases, which produce task-independent artefactual structures in representations. The discrete symmetry of contemporary forms is shown to be a strong predictor for the production of discrete representations emerging from otherwise continuous distributions -- a quantisation effect. This motivates further reassessment of functional forms in common usage due to such unintended consequences. Moreover, this supports a general causal model for one mode in which discrete representations may form, and could constitute a prerequisite for downstream interpretability phenomena, including grandmother neurons, discrete coding schemes, general linear features and possibly Superposition. Hence, this tool and proposed mechanism for the influence of functional form on representations may provide insights into interpretability research. Finally, preliminary results indicate that quantisation of representations appears to correlate with a measurable increase in reconstruction error, reinforcing previous conjectures that this collapse can be detrimental.
LGMar 26, 2021
Backpropagation Through Time For Networks With Long-Term DependenciesGeorge Bird, Maxim E. Polivoda
Backpropagation through time (BPTT) is a technique of updating tuned parameters within recurrent neural networks (RNNs). Several attempts at creating such an algorithm have been made including: Nth Ordered Approximations and Truncated-BPTT. These methods approximate the backpropagation gradients under the assumption that the RNN only utilises short-term dependencies. This is an acceptable assumption to make for the current state of artificial neural networks. As RNNs become more advanced, a shift towards influence by long-term dependencies is likely. Thus, a new method for backpropagation is required. We propose using the 'discrete forward sensitivity equation' and a variant of it for single and multiple interacting recurrent loops respectively. This solution is exact and also allows the network's parameters to vary between each subsequent step, however it does require the computation of a Jacobian.