Nikita Puchkin

LG
Semantic Scholar Profile
h-index5
16papers
128citations
Novelty56%
AI Score53

16 Papers

OCNov 7, 2023
Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Nikita Puchkin, Eduard Gorbunov, Nikolay Kutuzov et al.

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than $\mathcal{O}(K^{-2(α- 1)/α})$, when the stochastic gradients have finite moments of order $α\in (1, 2]$. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, using smoothed medians of means. We prove that the resulting estimates have negligible bias and controllable variance. This allows us to carefully incorporate them into clipped-SGD and clipped-SSTM and derive new high-probability complexity bounds in the considered setup.

LGFeb 21, 2023
Exploring Local Norms in Exp-concave Statistical Learning

Nikita Puchkin, Nikita Zhivotovskiy · eth-zurich

We consider the problem of stochastic convex optimization with exp-concave losses using Empirical Risk Minimization in a convex class. Answering a question raised in several prior works, we provide a $O( d / n + \log( 1 / δ) / n )$ excess risk bound valid for a wide class of bounded exp-concave losses, where $d$ is the dimension of the convex reference set, $n$ is the sample size, and $δ$ is the confidence level. Our result is based on a unified geometric assumption on the gradient of losses and the notion of local norms.

MLFeb 9
Schrödinger bridge problem via empirical risk minimization

Denis Belomestny, Alexey Naumov, Nikita Puchkin et al.

We study the Schrödinger bridge problem when the endpoint distributions are available only through samples. Classical computational approaches estimate Schrödinger potentials via Sinkhorn iterations on empirical measures and then construct a time-inhomogeneous drift by differentiating a kernel-smoothed dual solution. In contrast, we propose a learning-theoretic route: we rewrite the Schrödinger system in terms of a single positive transformed potential that satisfies a nonlinear fixed-point equation and estimate this potential by empirical risk minimization over a function class. We establish uniform concentration of the empirical risk around its population counterpart under sub-Gaussian assumptions on the reference kernel and terminal density. We plug the learned potential into a stochastic control representation of the bridge to generate samples. We illustrate performance of the suggested approach with numerical experiments.

MLJun 21, 2022
A Contrastive Approach to Online Change Point Detection

Artur Goldman, Nikita Puchkin, Valeriia Shcherbakova et al.

We suggest a novel procedure for online change point detection. Our approach expands an idea of maximizing a discrepancy measure between points from pre-change and post-change distributions. This leads to a flexible procedure suitable for both parametric and nonparametric scenarios. We prove non-asymptotic bounds on the average running length of the procedure and its expected detection delay. The efficiency of the algorithm is illustrated with numerical experiments on synthetic and real-world data sets.

LGDec 25, 2025
Approximation Capabilities of Feedforward Neural Networks with GELU Activations

Konstantin Yakovlev, Nikita Puchkin

We derive an approximation error bound that holds simultaneously for a function and all its derivatives up to any prescribed order. The bounds apply to elementary functions, including multivariate polynomials, the exponential function, and the reciprocal function, and are obtained using feedforward neural networks with the Gaussian Error Linear Unit (GELU) activation. In addition, we report the network size, weight magnitudes, and behavior at infinity. Our analysis begins with a constructive approximation of multiplication, where we prove the simultaneous validity of error bounds over domains of increasing size for a given approximator. Leveraging this result, we obtain approximation guarantees for division and the exponential function, ensuring that all higher-order derivatives of the resulting approximators remain globally bounded.

NADec 29, 2025
Simultaneous Approximation of the Score Function and Its Derivatives by Deep Neural Networks

Konstantin Yakovlev, Nikita Puchkin

We present a theory for simultaneous approximation of the score function and its derivatives, enabling the handling of data distributions with low-dimensional structure and unbounded support. Our approximation error bounds match those in the literature while relying on assumptions that relax the usual bounded support requirement. Crucially, our bounds are free from the curse of dimensionality. Moreover, we establish approximation guarantees for derivatives of any prescribed order, extending beyond the commonly considered first-order setting.

STDec 30, 2025
Implicit score matching meets denoising score matching: improved rates of convergence and log-density Hessian estimation

Konstantin Yakovlev, Anna Markovich, Nikita Puchkin

We study the problem of estimating the score function using both implicit score matching and denoising score matching. Assuming that the data distribution exhibiting a low-dimensional structure, we prove that implicit score matching is able not only to adapt to the intrinsic dimension, but also to achieve the same rates of convergence as denoising score matching in terms of the sample size. Furthermore, we demonstrate that both methods allow us to estimate log-density Hessians without the curse of dimensionality by simple differentiation. This justifies convergence of ODE-based samplers for generative diffusion models. Our approach is based on Gagliardo-Nirenberg-type inequalities relating weighted $L^2$-norms of smooth functions and their derivatives.

LGFeb 19, 2025
Generalization error bound for denoising score matching under relaxed manifold assumption

Konstantin Yakovlev, Nikita Puchkin

We examine theoretical properties of the denoising score matching estimate. We model the density of observations with a nonparametric Gaussian mixture. We significantly relax the standard manifold assumption allowing the samples step away from the manifold. At the same time, we are still able to leverage a nice distribution structure. We derive non-asymptotic bounds on the approximation and generalization errors of the denoising score matching estimate. The rates of convergence are determined by the intrinsic dimension. Furthermore, our bounds remain valid even if we allow the ambient dimension grow polynomially with the sample size.

LGAug 26, 2024
Score-based change point detection via tracking the best of infinitely many experts

Anna Markovich, Nikita Puchkin

We propose an algorithm for nonparametric online change point detection based on sequential score function estimation and the tracking the best expert approach. The core of the procedure is a version of the fixed share forecaster tailored to the case of infinite number of experts and quadratic loss functions. The algorithm shows promising results in numerical experiments on artificial and real-world data sets. Its performance is supported by rigorous high-probability bounds describing behaviour of the test statistic in the pre-change and post-change regimes.

STFeb 21, 2025
Dimension-free bounds in high-dimensional linear regression via error-in-operator approach

Fedor Noskov, Nikita Puchkin, Vladimir Spokoiny

We consider a problem of high-dimensional linear regression with random design. We suggest a novel approach referred to as error-in-operator which does not estimate the design covariance $Σ$ directly but incorporates it into empirical risk minimization. We provide an expansion of the excess prediction risk and derive non-asymptotic dimension-free bounds on the leading term and the remainder. This helps us to show that auxiliary variables do not increase the effective dimension of the problem, provided that parameters of the procedure are tuned properly. We also discuss computational aspects of our method and illustrate its performance with numerical experiments.

LGAug 10, 2025
Tight Bounds for Schrödinger Potential Estimation in Unpaired Data Translation

Nikita Puchkin, Denis Suchkov, Alexey Naumov et al.

Modern methods of generative modelling and unpaired data translation based on Schrödinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired data translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrödinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schrödinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.

LGJun 3, 2025
Sample complexity of Schrödinger potential estimation

Nikita Puchkin, Iurii Pustovalov, Yuri Sapronov et al.

We address the problem of Schrödinger potential estimation, which plays a crucial role in modern generative modelling approaches based on Schrödinger bridges and stochastic optimal control for SDEs. Given a simple prior diffusion process, these methods search for a path between two given distributions $ρ_0$ and $ρ_T^*$ requiring minimal efforts. The optimal drift in this case can be expressed through a Schrödinger potential. In the present paper, we study generalization ability of an empirical Kullback-Leibler (KL) risk minimizer over a class of admissible log-potentials aimed at fitting the marginal distribution at time $T$. Under reasonable assumptions on the target distribution $ρ_T^*$ and the prior process, we derive a non-asymptotic high-probability upper bound on the KL-divergence between $ρ_T^*$ and the terminal density corresponding to the estimated log-potential. In particular, we show that the excess KL-risk may decrease as fast as $O(\log^2 n / n)$ when the sample size $n$ tends to infinity even if both $ρ_0$ and $ρ_T^*$ have unbounded supports.

LGJan 31, 2021
Exponential Savings in Agnostic Active Learning through Abstention

Nikita Puchkin, Nikita Zhivotovskiy

We show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss $1/2$ of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.

STJan 30, 2021
Rates of convergence for density estimation with generative adversarial networks

Nikita Puchkin, Sergey Samsonov, Denis Belomestny et al.

In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We prove an oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $\mathsf{p}^*$ and the GAN estimate with a significantly better statistical error term compared to the previously known results. The advantage of our bound becomes clear in application to nonparametric density estimation. We show that the JS-divergence between the GAN estimate and $\mathsf{p}^*$ decays as fast as $(\log{n}/n)^{2β/(2β+ d)}$, where $n$ is the sample size and $β$ determines the smoothness of $\mathsf{p}^*$. This rate of convergence coincides (up to logarithmic factors) with minimax optimal for the considered class of densities.

STJun 12, 2019
Structure-adaptive manifold estimation

Nikita Puchkin, Vladimir Spokoiny

We consider a problem of manifold estimation from noisy observations. Many manifold learning procedures locally approximate a manifold by a weighted average over a small neighborhood. However, in the presence of large noise, the assigned weights become so corrupted that the averaged estimate shows very poor performance. We suggest a structure-adaptive procedure, which simultaneously reconstructs a smooth manifold and estimates projections of the point cloud onto this manifold. The proposed approach iteratively refines the weights on each step, using the structural information obtained at previous steps. After several iterations, we obtain nearly "oracle" weights, so that the final estimates are nearly efficient even in the presence of relatively large noise. In our theoretical study, we establish tight lower and upper bounds proving asymptotic optimality of the method for manifold estimation under the Hausdorff loss, provided that the noise degrades to zero fast enough.

MLApr 8, 2018
An adaptive multiclass nearest neighbor classifier

Nikita Puchkin, Vladimir Spokoiny

We consider a problem of multiclass classification, where the training sample $S_n = \{(X_i, Y_i)\}_{i=1}^n$ is generated from the model $\mathbb P(Y = m | X = x) = η_m(x)$, $1 \leq m \leq M$, and $η_1(x), \dots, η_M(x)$ are unknown $α$-Holder continuous functions.Given a test point $X$, our goal is to predict its label. A widely used $\mathsf k$-nearest-neighbors classifier constructs estimates of $η_1(X), \dots, η_M(X)$ and uses a plug-in rule for the prediction. However, it requires a proper choice of the smoothing parameter $\mathsf k$, which may become tricky in some situations. In our solution, we fix several integers $n_1, \dots, n_K$, compute corresponding $n_k$-nearest-neighbor estimates for each $m$ and each $n_k$ and apply an aggregation procedure. We study an algorithm, which constructs a convex combination of these estimates such that the aggregated estimate behaves approximately as well as an oracle choice. We also provide a non-asymptotic analysis of the procedure, prove its adaptation to the unknown smoothness parameter $α$ and to the margin and establish rates of convergence under mild assumptions.