Yiding Song

LG
h-index22
3papers
19citations
Novelty50%
AI Score42

3 Papers

LGMay 10
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

Yiding Song, Hanming Ye

Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed $T_{\text{mem}}(P)$ and a generalisation speed $T_{\text{gen}}(P)$, both of which are functions of model parameter count $P$. Adapting the information capacity framework of Morris et al. (2025), we estimate $T_{\text{mem}}(P)$ on random-label data of equivalent complexity and $T_{\text{gen}}(P)$ on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect. The framework also suggests an empirical model for predicting memorisation speed given model capacity and dataset complexity, recovering the previously reported empirical observation that larger models memorise faster. Overall, we motivate the formalisation of different learning timescales as important abstractions to study when explaining how model capacity shapes grokking on algorithmic tasks.

IMMar 13, 2024
PAPERCLIP: Associating Astronomical Observations and Natural Language with Multi-Modal Models

Siddharth Mishra-Sharma, Yiding Song, Jesse Thaler

We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training), a method which associates astronomical observations imaged by telescopes with natural language using a neural network model. The model is fine-tuned from a pre-trained Contrastive Language-Image Pre-training (CLIP) model using successful observing proposal abstracts and corresponding downstream observations, with the abstracts optionally summarized via guided generation using large language models (LLMs). Using observations from the Hubble Space Telescope (HST) as an example, we show that the fine-tuned model embodies a meaningful joint representation between observations and natural language through tests targeting image retrieval (i.e., finding the most relevant observations using natural language queries) and description retrieval (i.e., querying for astrophysical object classes and use cases most relevant to a given observation). Our study demonstrates the potential for using generalist foundation models rather than task-specific models for interacting with astronomical data by leveraging text as an interface.

LGJul 19, 2025
Pruning Increases Orderedness in Recurrent Computation

Yiding Song

Inspired by the prevalence of recurrent circuits in biological brains, we investigate the degree to which directionality is a helpful inductive bias for artificial neural networks. Taking directionality as topologically-ordered information flow between neurons, we formalise a perceptron layer with all-to-all connections (mathematically equivalent to a weight-tied recurrent neural network) and demonstrate that directionality, a hallmark of modern feed-forward networks, can be induced rather than hard-wired by applying appropriate pruning techniques. Across different random seeds our pruning schemes successfully induce greater topological ordering in information flow between neurons without compromising performance, suggesting that directionality is not a prerequisite for learning, but may be an advantageous inductive bias discoverable by gradient descent and sparsification.