Aditya Kudre

57.3LGMay 15

Transformer-like Inference from Optimal Control

Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta

Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.

12.3SYApr 5

Duality Theory for Non-Markovian Linear Gaussian Models

Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta

This work develops a duality theory for partially observed linear Gaussian models in discrete time. The state process evolves according to a causal but non-Markovian (or higher-order Gauss-Markov) structure, captured by a lower-triangular transition operator, which is related to transformer, with $T$ as the context length. The main contributions are: (i) a dual control system for the linear Gaussian model, formulated as a backward difference equation (B $Î$ E); (ii) a duality principle establishing that a specific linear-quadratic optimal control problem for the B $Î$ E is dual to the filtering problem for the partially observed model; and (iii) an explicit optimal control formula yielding a novel (transformer-like) linear predictor, referred to as the dual filter, whose computational complexity scales linearly in the time horizon $T$, in contrast to the $O(T^3)$ cost of classical smoothing and Wiener-Hopf approaches.

Aditya Kudre

2 Papers