LG AIJan 23

Process-Tensor Tomography of SGD: Measuring Non-Markovian Memory via Back-Flow of Distinguishability

arXiv:2601.16563v11.4h-index: 7

Originality Incremental advance

AI Analysis

This provides a principled diagnostic tool for understanding training memory in SGD, which is incremental but offers a testable framework for comparing optimizers and curricula.

The paper tackles the problem of measuring non-Markovian memory in stochastic gradient descent (SGD) training by introducing a model-agnostic witness based on back-flow of distinguishability, showing consistent positive effects with tight confidence intervals and amplification under conditions like higher momentum.

This work proposes neural training as a \emph{process tensor}: a multi-time map that takes a sequence of controllable instruments (batch choices, augmentations, optimizer micro-steps) and returns an observable of the trained model. Building on this operational lens, we introduce a simple, model-agnostic witness of training memory based on \emph{back-flow of distinguishability}. In a controlled two-step protocol, we compare outcome distributions after one intervention versus two; the increase $Δ_{\mathrm{BF}} = D_2 - D_1>0$ (with $D\in\{\mathrm{TV}, \mathrm{JS}, \mathrm{H}\}$ measured on softmax predictions over a fixed probe set) certifies non-Markovianity. We observe consistent positive back-flow with tight bootstrap confidence intervals, amplification under higher momentum, larger batch overlap, and more micro-steps, and collapse under a \emph{causal break} (resetting optimizer state), directly attributing the effect to optimizer/data-state memory. The witness is robust across TV/JS/Hellinger, inexpensive to compute, and requires no architectural changes. We position this as a \emph{measurement} contribution: a principled diagnostic and empirical evidence that practical SGD deviates from the Markov idealization. An exploratory case study illustrates how the micro-level signal can inform curriculum orderings. "Data order matters" turns into a testable operator with confidence bounds, our framework offers a common stage to compare optimizers, curricula, and schedules through their induced training memory.

View on arXiv PDF

Similar