LGJul 17, 2025
Change of Thought: Adaptive Test-Time ComputationMrinal Mathur, Mike Doan, Barak Pearlmutter et al.
Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling -- first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this "thinking aloud" mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing -- in one pass -- the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that matrix internally, scaling test-time computation with input difficulty. This adaptivity yields up to 20\% accuracy gains on encoder-style benchmarks without increasing parameter count, demonstrating that input-adaptive alignment at test time offers substantial benefits for only a modest extra compute budget. Self-Transformers thus recover much of the expressive power of iterative reasoning while preserving the simplicity of pure encoder architectures.
LGFeb 18, 2021
Peering Beyond the Gradient Veil with Distributed Auto DifferentiationBradley T. Baker, Aashis Khanal, Vince D. Calhoun et al.
Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts. The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs) tends to be communication-heavy, often requiring additional adaptations such as sparsity constraints, compression, quantization, and more, to curtail bandwidth. We introduce an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation. The exposed structure of the gradient evokes a new class of distributed learning algorithm, which is naturally more communication-efficient than full gradient sharing. Our approach, called distributed auto-differentiation (dAD), builds off a marriage of rank-based compression and the innate structure of the gradient as an outer-product. We demonstrate that dAD trains more efficiently than other state of the art distributed methods on modern architectures, such as transformers, when applied to large-scale text and imaging datasets. The future of distributed learning, we determine, need not be dominated by gradient-centric algorithms.
CPJun 7, 2017
Mini-symposium on automatic differentiation and its applications in the financial industrySébastien Geeraert, Charles-Albert Lehalle, Barak Pearlmutter et al.
Automatic differentiation is involved for long in applied mathematics as an alternative to finite difference to improve the accuracy of numerical computation of derivatives. Each time a numerical minimization is involved, automatic differentiation can be used. In between formal derivation and standard numerical schemes, this approach is based on software solutions applying mechanically the chain rule to obtain an exact value for the desired derivative. It has a cost in memory and cpu consumption. For participants of financial markets (banks, insurances, financial intermediaries, etc), computing derivatives is needed to obtain the sensitivity of its exposure to well-defined potential market moves. It is a way to understand variations of their balance sheets in specific cases. Since the 2008 crisis, regulation demand to compute this kind of exposure to many different case, to be sure market participants are aware and ready to face a wide spectrum of configurations. This paper shows how automatic differentiation provides a partial answer to this recent explosion of computation to perform. One part of the answer is a straightforward application of Adjoint Algorithmic Differentiation (AAD), but it is not enough. Since financial sensitivities involves specific functions and mix differentiation with Monte-Carlo simulations, dedicated tools and associated theoretical results are needed. We give here short introductions to typical cases arising when one use AAD on financial markets.