Kyle J. Ray

STAT-MECH
h-index14
3papers
6citations
Novelty67%
AI Score43

3 Papers

LGFeb 2
Transformers learn factored representations

Adam Shai, Loren Amdahl-Culleton, Casper L. Christensen et al.

Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. The factored representation is lossless when factors are conditionally independent, but sacrifices predictive fidelity otherwise, creating a tradeoff between dimensional efficiency and accuracy. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. Models learn factored representations when factors are conditionally independent, and continue to favor them early in training even when noise or hidden dependencies undermine conditional independence, reflecting an inductive bias toward factoring at the cost of fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.

STAT-MECHApr 26, 2025
Learning Stochastic Thermodynamics Directly from Correlation and Trajectory-Fluctuation Currents

Jinghao Lyu, Kyle J. Ray, James P. Crutchfield

Markedly increased computational power and data acquisition have led to growing interest in data-driven inverse dynamics problems. These seek to answer a fundamental question: What can we learn from time series measurements of a complex dynamical system? For small systems interacting with external environments, the effective dynamics are inherently stochastic, making it crucial to properly manage noise in data. Here, we explore this for systems obeying Langevin dynamics and, using currents, we construct a learning framework for stochastic modeling. Currents have recently gained increased attention for their role in bounding entropy production (EP) from thermodynamic uncertainty relations (TURs). We introduce a fundamental relationship between the cumulant currents there and standard machine-learning loss functions. Using this, we derive loss functions for several key thermodynamic functions directly from the system dynamics without the (common) intermediate step of deriving a TUR. These loss functions reproduce results derived both from TURs and other methods. More significantly, they open a path to discover new loss functions for previously inaccessible quantities. Notably, this includes access to per-trajectory entropy production, even if the observed system is driven far from its steady-state. We also consider higher order estimation. Our method is straightforward and unifies dynamic inference with recent approaches to entropy production estimation. Taken altogether, this reveals a deep connection between diffusion models in machine learning and entropy production estimation in stochastic thermodynamics.

STAT-MECHOct 4, 2025
Optimal Computation from Fluctuation Responses

Jinghao Lyu, Kyle J. Ray, James P. Crutchfield

The energy cost of computation has emerged as a central challenge at the intersection of physics and computer science. Recent advances in statistical physics -- particularly in stochastic thermodynamics -- enable precise characterizations of work, heat, and entropy production in information-processing systems driven far from equilibrium by time-dependent control protocols. A key open question is then how to design protocols that minimize thermodynamic cost while ensur- ing correct outcomes. To this end, we develop a unified framework to identify optimal protocols using fluctuation response relations (FRR) and machine learning. Unlike previous approaches that optimize either distributions or protocols separately, our method unifies both using FRR-derived gradients. Moreover, our method is based primarily on iteratively learning from sampled noisy trajectories, which is generally much easier than solving for the optimal protocol directly from a set of governing equations. We apply the framework to canonical examples -- bit erasure in a double-well potential and translating harmonic traps -- demonstrating how to construct loss functions that trade-off energy cost against task error. The framework extends trivially to underdamped systems, and we show this by optimizing a bit-flip in an underdamped system. In all computations we test, the framework achieves the theoretically optimal protocol or achieves work costs comparable to relevant finite time bounds. In short, the results provide principled strategies for designing thermodynamically efficient protocols in physical information-processing systems. Applications range from quantum gates robust under noise to energy-efficient control of chemical and synthetic biological networks.