LGMay 19Code
LionMuon: Alternating Spectral and Sign Descent for Efficient TrainingArman Bolatov, Artem Riabinin, Nikita Kornilov et al.
In large-scale optimization, the cheapness and effectiveness of update steps are the most crucial factors for a successful optimizer. Sign-based optimizers like Lion or Signum produce cheap per-step updates, whereas Muon's spectral matrix-sign update gives a much stronger direction at a substantially higher per-step cost. In this work, we propose LionMuon, which retains the effectiveness of Muon steps while considerably cutting the averaged iteration cost, similar to sign-based methods. It alternates between Lion's and Muon's updates on a fixed period P, sharing a single dual-EMA momentum buffer between them. The optimizer state memory therefore matches Lion and is exactly half of AdamW's. A simpler single-EMA variant, SignMuon, by itself already outperforms pure Muon. At P = 2, LionMuon Pareto-dominates Muon, Lion, Signum, and AdamW on every dataset and architecture we tested at 124M model size, reaching lower validation loss at lower compute, and the same advantage persists at 355M and 720M scale. On the theory side, we prove sharp complexity bounds under heavy-tailed noise which are governed by period-averaged smoothness and noise that interpolate between Muon's and Lion's constants. These bounds predict the compute-optimal period and the conditions under which LionMuon outruns Muon and Lion. Code: https://github.com/brain-lab-research/lion-muon
MLOct 31, 2025
On the Equivalence of Optimal Transport Problem and Action Matching with Optimal Vector FieldsNikita Kornilov, Alexander Korotin
Flow Matching (FM) method in generative modeling maps arbitrary probability distributions by constructing an interpolation between them and then learning the vector field that defines ODE for this interpolation. Recently, it was shown that FM can be modified to map distributions optimally in terms of the quadratic cost function for any initial interpolation. To achieve this, only specific optimal vector fields, which are typical for solutions of Optimal Transport (OT) problems, need to be considered during FM loss minimization. In this note, we show that considering only optimal vector fields can lead to OT in another approach: Action Matching (AM). Unlike FM, which learns a vector field for a manually chosen interpolation between given distributions, AM learns the vector field that defines ODE for an entire given sequence of distributions.
OCFeb 11, 2025
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-SmoothnessNikita Kornilov, Philip Zmushko, Andrei Semenov et al.
In recent years, non-convex optimization problems are more often described by generalized $(L_0, L_1)$-smoothness assumption rather than standard one. Meanwhile, severely corrupted data used in these problems has increased the demand for methods capable of handling heavy-tailed noises, i.e., noises with bounded $κ$-th moment. Motivated by these real-world trends and challenges, we explore sign-based methods in this setup and demonstrate their effectiveness in comparison with other popular solutions like clipping or normalization. In theory, we prove the first-known high probability convergence bounds under $(L_0, L_1)$-smoothness and heavy-tailed noises with mild parameter dependencies. In the case of standard smoothness, these bounds are novel for sign-based methods as well. In particular, SignSGD with batching achieves sample complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[1 + \left(\fracσ{\varepsilon}\right)^\fracκ{κ-1}\right]\right), κ\in (1,2]$. Under the assumption of symmetric noises, SignSGD with Majority Voting can robustly work on the whole range of $κ\in (0,2]$ with complexity $\tilde{O}\left(\left(\frac{ΔL_0d}{\varepsilon^2} + \frac{ΔL_1d^\frac{3}{2}}{\varepsilon}\right)\left[\frac{1}{κ^2} + \frac{σ^2}{\varepsilon^2}\right]\right)$. We also obtain results for parameter-agnostic setups, Polyak-Lojasiewicz functions and momentum-based methods (in expectation). Our theoretical findings are supported by the superior performance of sign-based methods in training Large Language Models compared to clipping and normalization.
MLSep 26, 2025
Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)Nikita Kornilov, David Li, Tikhon Mavrin et al.
While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are naturally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and is also extended to their modifications, such as Bridge Matching and Stochastic Interpolants.
MLMar 19, 2024
Optimal Flow Matching: Learning Straight Trajectories in Just One StepNikita Kornilov, Petr Mokrov, Alexander Gasnikov et al.
Over the several recent years, there has been a boom in development of Flow Matching (FM) methods for generative modeling. One intriguing property pursued by the community is the ability to learn flows with straight trajectories which realize the Optimal Transport (OT) displacements. Straightness is crucial for the fast integration (inference) of the learned flow's paths. Unfortunately, most existing flow straightening methods are based on non-trivial iterative FM procedures which accumulate the error during training or exploit heuristics based on minibatch OT. To address these issues, we develop and theoretically justify the novel \textbf{Optimal Flow Matching} (OFM) approach which allows recovering the straight OT displacement for the quadratic transport in just one FM step. The main idea of our approach is the employment of vector field for FM which are parameterized by convex functions.
LGMay 11, 2023
Implicitly normalized forecaster with clipping for linear and non-linear heavy-tailed multi-armed banditsYuriy Dorn, Nikita Kornilov, Nikolay Kutuzov et al.
The Implicitly Normalized Forecaster (INF) algorithm is considered to be an optimal solution for adversarial multi-armed bandit (MAB) problems. However, most of the existing complexity results for INF rely on restrictive assumptions, such as bounded rewards. Recently, a related algorithm was proposed that works for both adversarial and stochastic heavy-tailed MAB settings. However, this algorithm fails to fully exploit the available data. In this paper, we propose a new version of INF called the Implicitly Normalized Forecaster with clipping (INF-clip) for MAB problems with heavy-tailed reward distributions. We establish convergence results under mild assumptions on the rewards distribution and demonstrate that INF-clip is optimal for linear heavy-tailed stochastic MAB problems and works well for non-linear ones. Furthermore, we show that INF-clip outperforms the best-of-both-worlds algorithm in cases where it is difficult to distinguish between different arms.