Nicholas Barnfield

ML
h-index5
4papers
10citations
Novelty74%
AI Score53

4 Papers

MLFeb 4
Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

Nicholas Barnfield, Subhabrata Sen, Pragya Sur

Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.

76.0MLMay 6
Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

Nicholas Barnfield, Juno Kim, Eshaan Nichani et al.

How many key-value associations can a $d\times d$ linear memory store? We show that the answer depends not only on the $d^2$ degrees of freedom in the memory matrix, but also on the retrieval criterion. In an isotropic Gaussian model for the stored pairs, we show that top-1 retrieval, where every signal must beat its largest distractor, requires the logarithmic model-size scale $d^2\asymp n\log n$. We prove that the correlation matrix memory construction, which stores associations by superposing key-target outer products, achieves this scale through a sharp phase transition, and that the same scaling is necessary for any linear memory. Thus the logarithm is the intrinsic extreme-value price of winner-take-all decoding. We next consider listwise retrieval, where the correct target need not be the unique top-scoring item but should remain among the strongest candidates. To formalize this regime, we propose the Tail-Average Margin (TAM), a convex upper-tail criterion that certifies inclusion of the correct target in a controlled candidate list. Under this listwise retrieval criterion, the capacity follows the quadratic scale $d^2\asymp n$. At load $n/d^2\toα$, we develop an exact asymptotic theory for the TAM empirical-risk minimizer through a two-parameter scalar variational principle. The theory has a rich phenomenology: in the ridgeless limit it yields a closed-form critical load separating satisfiable and unsatisfiable phases, and it predicts the limiting laws of true scores, competitor scores, margins, and percentile profiles. Finally, a small-tail extrapolation further leads to the conjectural sharp top-1 threshold $d^2\sim 2n\log n$.

23.9OCApr 29
A Scale-Shape Dual Newton Method for Entropic Least Squares

Nicholas Barnfield, James V. Burke, Michael P. Friedlander et al.

We give a damped inexact Newton method for entropy-regularized least-squares on the nonnegative orthant that converges globally at a linear rate with $O(\logε^{-1})$ iteration complexity, locally at a superlinear-to-quadratic rate, and is immune to the finite-precision overflow that limits classical dual solvers. A scale-shape decomposition of the primal -- separating its scale from its direction -- produces a dual with a nonsingular Jacobian. Objectives and Jacobians are evaluated through stable log-sum-exp and softmax primitives. Lambert W bounds on the scale uniformly control the Jacobian's spectrum, from which both rates follow. The solution map is jointly Lipschitz in the data, regularization parameter, and reference measure, and extends continuously to the vanishing-regularization limit. Experiments on a problem from analytic continuation of quantum Monte Carlo data confirm the predicted overflow resilience and convergence behavior.

LGSep 29, 2025
High-Dimensional Analysis of Single-Layer Attention for Sparse-Token Classification

Nicholas Barnfield, Hugo Cui, Yue M. Lu

When and how can an attention mechanism learn to selectively attend to informative tokens, thereby enabling detection of weak, rare, and sparsely located features? We address these questions theoretically in a sparse-token classification model in which positive samples embed a weak signal vector in a randomly chosen subset of tokens, whereas negative samples are pure noise. In the long-sequence limit, we show that a simple single-layer attention classifier can in principle achieve vanishing test error when the signal strength grows only logarithmically in the sequence length $L$, whereas linear classifiers require $\sqrt{L}$ scaling. Moving from representational power to learnability, we study training at finite $L$ in a high-dimensional regime, where sample size and embedding dimension grow proportionally. We prove that just two gradient updates suffice for the query weight vector of the attention classifier to acquire a nontrivial alignment with the hidden signal, inducing an attention map that selectively amplifies informative tokens. We further derive an exact asymptotic expression for the test error and training loss of the trained attention-based classifier, and quantify its capacity -- the largest dataset size that is typically perfectly separable -- thereby explaining the advantage of adaptive token selection over nonadaptive linear baselines.