MLLGSep 18, 2025

Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models

arXiv:2509.15152v13 citationsh-index: 3MLSP
Originality Incremental advance
AI Analysis

This provides theoretical and empirical insights into how MLP layers enhance ICL and the effects of nonlinearity and over-parameterization, which is incremental for researchers studying Transformers and in-context learning.

The paper tackled understanding in-context learning in pretrained Transformers for nonlinear regression by analyzing a random Transformer with a fixed first layer and trained second layer in an asymptotic regime, showing it behaves equivalently to a finite-degree Hermite polynomial model in terms of ICL error, with simulations validating this across various settings and revealing a double-descent phenomenon.

We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the second layer is trained. Furthermore, we consider an asymptotic regime where the context length, input dimension, hidden dimension, number of training tasks, and number of training samples jointly grow. In this setting, we show that the random Transformer behaves equivalent to a finite-degree Hermite polynomial model in terms of ICL error. This equivalence is validated through simulations across varying activation functions, context lengths, hidden layer widths (revealing a double-descent phenomenon), and regularization settings. Our results offer theoretical and empirical insights into when and how MLP layers enhance ICL, and how nonlinearity and over-parameterization influence model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes