Sergey Pankov

LG
4papers
15citations
Novelty59%
AI Score31

4 Papers

ROApr 5, 2022
Configuration Path Control

Sergey Pankov

Reinforcement learning methods often produce brittle policies -- policies that perform well during training, but generalize poorly beyond their direct training experience, thus becoming unstable under small disturbances. To address this issue, we propose a method for stabilizing a control policy in the space of configuration paths. It is applied post-training and relies purely on the data produced during training, as well as on an instantaneous control-matrix estimation. The approach is evaluated empirically on a planar bipedal walker subjected to a variety of perturbations. The control policies obtained via reinforcement learning are compared against their stabilized counterparts. Across different experiments, we find two- to four-fold increase in stability, when measured in terms of the perturbation amplitudes. We also provide a zero-dynamics interpretation of our approach.

LGMay 21, 2025
SUS backprop: linear backpropagation algorithm for long inputs in transformers

Sergey Pankov, Georges Harik

It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of backpropagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length $n$. At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule controlled by a single parameter $c$ that cuts back-propagation through most attention weights, leaving at most $c$ interactions per token per attention head. This brings a factor of $c/n$ reduction in the compute required for the attention backpropagation, turning it from quadratic $O(n^2)$ to linear complexity $O(nc)$. We have empirically verified that, for a typical transformer model, cutting about $99\%$ of the attention gradient flow (i.e. choosing $c \sim 25-30$) results in relative gradient variance increase of only about $1\%$ for $n \sim 2000$, and it decreases with $n$. This approach is amenable to efficient sparse matrix implementation, thus being promising for making the cost of a backward pass negligible relative to the cost of a forward pass when training a transformer model on long sequences.

BIO-PHJun 20, 2021
Three-dimensional bipedal model with zero-energy-cost walking

Sergey Pankov

We study a three-dimensional articulated rigid-body biped model that possesses zero cost of transport walking gaits. Energy losses are avoided due to the complete elimination of the foot-ground collisions by the concerted oscillatory motion of the model's parts. The model consists of two parts connected via a universal joint. It does not rely on any geometry altering mechanisms, massless parts or springs. Despite the model's simplicity, its collisionless gaits feature walking with finite speed, foot clearance and ground friction. The collisionless spectrum can be studied analytically in the small movement limit, revealing infinitely many periodic modes. The modes differ in the number of sagittal and coronal plane oscillations at different stages of the walking cycle. We focus on the mode with the minimal number of such oscillations, presenting its complete analytical solution. We then numerically evolve it toward a general non-small movement solution. A general collisionless mode can be tuned by adjusting a single model parameter. Some of the presented results display a surprising degree of generality and universality.

LGNov 15, 2018
Reward-estimation variance elimination in sequential decision processes

Sergey Pankov

Policy gradient methods are very attractive in reinforcement learning due to their model-free nature and convergence guarantees. These methods, however, suffer from high variance in gradient estimation, resulting in poor sample efficiency. To mitigate this issue, a number of variance-reduction approaches have been proposed. Unfortunately, in the challenging problems with delayed rewards, these approaches either bring a relatively modest improvement or do reduce variance at expense of introducing a bias and undermining convergence. The unbiased methods of gradient estimation, in general, only partially reduce variance, without eliminating it completely even in the limit of exact knowledge of the value functions and problem dynamics, as one might have wished. In this work we propose an unbiased method that does completely eliminate variance under some, commonly encountered, conditions. Of practical interest is the limit of deterministic dynamics and small policy stochasticity. In the case of a quadratic value function, as in linear quadratic Gaussian models, the policy randomness need not be small. We use such a model to analyze performance of the proposed variance-elimination approach and compare it with standard variance-reduction methods. The core idea behind the approach is to use control variates at all future times down the trajectory. We present both a model-based and model-free formulations.