LGSTMLJun 5, 2025

Transformers Meet In-Context Learning: A Universal Approximation Theory

arXiv:2506.05200v26 citationsh-index: 9
Originality Highly original
AI Analysis

This provides a foundational theoretical framework for in-context learning in large language models, extending beyond prior limitations to non-convex problems.

The paper tackles the problem of understanding how transformers enable in-context learning by developing a universal approximation theory, showing that transformers can predict tasks with vanishingly small risk using a few noisy examples without weight updates.

Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how transformers enable in-context learning. For a general class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can predict based on a few noisy in-context examples with vanishingly small risk. Unlike prior work that frames transformers as approximators of optimization algorithms (e.g., gradient descent) for statistical learning tasks, we integrate Barron's universal function approximation theory with the algorithm approximator viewpoint. Our approach yields approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being mimicked, extending far beyond convex problems like linear regression. The key is to show that (i) any target function can be nearly linearly represented, with small $\ell_1$-norm, over a set of universal features, and (ii) a transformer can be constructed to find the linear representation -- akin to solving Lasso -- at test time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes