Riccardo Rende

DIS-NN
h-index55
4papers
71citations
Novelty55%
AI Score46

4 Papers

DIS-NNJun 1
Scaling Laws for Neural-Network Quantum States

Riccardo Rende, Alessandro Sinibaldi, Luciano Loris Viteritti et al.

Scaling laws, the power-law relations between loss, architecture size, and compute observed in modern neural networks, offer a quantitative way to characterize the complexity of a learning problem, with the exponent governing the decay of the loss reflecting how rapidly additional resources translate into improved accuracy, and thus how hard the target is to learn. Whether an analogous framework can characterize the complexity of physical problems remains open. We address this question for Neural-Network Quantum States, a leading variational approach for strongly correlated quantum many-body systems. Using transformer wave functions to approximate ground states of the $J_1$-$J_2$ Heisenberg model on triangular and square lattices with up to $20\times 20$ sites, we find that the $V$-score, a measure of accuracy of a variational state, decays as a power law in training compute. Under an appropriate rescaling of compute, results for different system sizes collapse onto a single curve, analogous to scaling collapse in critical phenomena. The resulting power law is, to a good approximation, independent of the number of sites, showing that the transformer Ansatz is size-consistent for the systems considered. The exponent decreases systematically with frustration, identifying it as a quantitative measure of representational difficulty of the ground state and establishing scaling laws as a general framework for benchmarking variational ansätze.

DIS-NNApr 14, 2023
Mapping of attention mechanisms to a generalized Potts model

Riccardo Rende, Federica Gerace, Alessandro Laio et al.

Transformers are neural networks that revolutionized natural language processing and machine learning. They process sequences of inputs, like words, using a mechanism called self-attention, which is trained via masked language modeling (MLM). In MLM, a word is randomly masked in an input sequence, and the network is trained to predict the missing word. Despite the practical success of transformers, it remains unclear what type of data distribution self-attention can learn efficiently. Here, we show analytically that if one decouples the treatment of word positions and embeddings, a single layer of self-attention learns the conditionals of a generalized Potts model with interactions between sites and Potts colors. Moreover, we show that training this neural network is exactly equivalent to solving the inverse Potts problem by the so-called pseudo-likelihood method, well known in statistical physics. Using this mapping, we compute the generalization error of self-attention in a model scenario analytically using the replica method.

CLOct 25, 2024
A distributional simplicity bias in the learning dynamics of transformers

Riccardo Rende, Federica Gerace, Alessandro Laio et al.

The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias'': neural networks prevent overfitting by initially learning simple classifiers before progressing to more complex, non-linear functions. While simplicity biases have been described theoretically and experimentally in feed-forward networks for supervised learning, the extent to which they also explain the remarkable success of transformers trained with self-supervised techniques remains unclear. In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we develop a procedure to generate \textit{clones} of a given natural language data set, which rigorously capture the interactions between tokens up to a specified order. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.

CLMay 21, 2025
A quantitative analysis of semantic information in deep representations of text and images

Santiago Acevedo, Andrea Mascaretti, Riccardo Rende et al.

Deep neural networks are known to develop similar representations for semantically related data, even when they belong to different domains, such as an image and its description, or the same text in different languages. We present a method for quantitatively investigating this phenomenon by measuring the relative information content of the representations of semantically related data and probing how it is encoded into multiple tokens of large language models (LLMs) and vision transformers. Looking first at how LLMs process pairs of translated sentences, we identify inner ``semantic'' layers containing the most language-transferable information. We find moreover that, on these layers, a larger LLM (DeepSeek-V3) extracts significantly more general information than a smaller one (Llama3.1-8B). Semantic information of English text is spread across many tokens and it is characterized by long-distance correlations between tokens and by a causal left-to-right (i.e., past-future) asymmetry. We also identify layers encoding semantic information within visual transformers. We show that caption representations in the semantic layers of LLMs predict visual representations of the corresponding images. We observe significant and model-dependent information asymmetries between image and text representations.