LGAIJan 10, 2025

Model Alignment Search

arXiv:2501.06164v73 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses the need for causal methods in neural similarity analysis, particularly for model alignment and comparisons with biological networks, though it appears incremental in methodology.

The authors tackled the problem of measuring functional similarity between neural systems by introducing a method that bidirectionally transfers neural activity and uses resulting behavior as a measure, showing it can reveal misalignment in fine-tuned models and reduce comparison complexity to linear in the number of models.

When can we say that two neural systems perform a task in the same way? What nuances do we miss when we fail to causally probe the representations of the systems, and how do we establish bidirectional causal relationships? In this work, we introduce a method that bidirectionally transfers neural activity between artificial neural networks and uses their resulting behavior as a measure of functional similarity. We first show that the method can be used to transfer the behavior from one frozen Neural Network (NN) to another in a manner similar to model stitching, and we show how the method can differ from correlative similarity measures like Representational Similarity Analysis. Next, we empirically and theoretically show how the method can be equivalent to model stitching when desired, or it can take a form that has a more restrictive focus to shared causal information; in both forms, it reduces the number of required matrices for a comparison of n models to be linear in n. We then present a case study on number-related tasks showing that the method can be used to examine specific subtypes of causal information demonstrating that numbers can be encoded differently in recurrent models depending on the task, and we present another case study showing that MAS can reveal misalignment in fine-tuned DeepSeek-r1-Qwen-1.5B models. Lastly, we augment the loss function with a counterfactual latent (CL) auxiliary objective to improve causal relevance when one of the two networks is causally inaccessible (as is often the case in comparisons with biological networks). We use our results to encourage the use of causal methods in neural similarity analyses and to suggest future explorations of network similarity methodology for model misalignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes