Guy Kornowski

OC
h-index7
16papers
207citations
Novelty53%
AI Score41

16 Papers

LGFeb 16, 2023
Deterministic Nonsmooth Nonconvex Optimization

Michael I. Jordan, Guy Kornowski, Tianyi Lin et al.

We study the complexity of optimizing nonsmooth nonconvex Lipschitz functions by producing $(δ,ε)$-stationary points. Several recent works have presented randomized algorithms that produce such points using $\tilde O(δ^{-1}ε^{-3})$ first-order oracle calls, independent of the dimension $d$. It has been an open problem as to whether a similar result can be obtained via a deterministic algorithm. We resolve this open problem, showing that randomization is necessary to obtain a dimension-free rate. In particular, we prove a lower bound of $Ω(d)$ for any deterministic algorithm. Moreover, we show that unlike smooth or convex optimization, access to function values is required for any deterministic algorithm to halt within any finite time. On the other hand, we prove that if the function is even slightly smooth, then the dimension-free rate of $\tilde O(δ^{-1}ε^{-3})$ can be obtained by a deterministic algorithm with merely a logarithmic dependence on the smoothness parameter. Motivated by these findings, we turn to study the complexity of deterministically smoothing Lipschitz functions. Though there are efficient black-box randomized smoothings, we start by showing that no such deterministic procedure can smooth functions in a meaningful manner, resolving an open question. We then bypass this impossibility result for the structured case of ReLU neural networks. To that end, in a practical white-box setting in which the optimizer is granted access to the network's architecture, we propose a simple, dimension-free, deterministic smoothing that provably preserves $(δ,ε)$-stationary points. Our method applies to a variety of architectures of arbitrary depth, including ResNets and ConvNets. Combined with our algorithm, this yields the first deterministic dimension-free algorithm for optimizing ReLU networks, circumventing our lower bound.

OCJul 10, 2023
An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization

Guy Kornowski, Ohad Shamir

We study the complexity of producing $(δ,ε)$-stationary points of Lipschitz objectives which are possibly neither smooth nor convex, using only noisy function evaluations. Recent works proposed several stochastic zero-order algorithms that solve this task, all of which suffer from a dimension-dependence of $Ω(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal. We refute this conjecture by providing a faster algorithm that has complexity $O(dδ^{-1}ε^{-3})$, which is optimal (up to numerical constants) with respect to $d$ and also optimal with respect to the accuracy parameters $δ,ε$, thus solving an open question due to Lin et al. (NeurIPS'22). Moreover, the convergence rate achieved by our algorithm is also optimal for smooth objectives, proving that in the nonconvex stochastic zero-order setting, nonsmooth optimization is as easy as smooth optimization. We provide algorithms that achieve the aforementioned convergence rate in expectation as well as with high probability. Our analysis is based on a simple yet powerful lemma regarding the Goldstein-subdifferential set, which allows utilizing recent advancements in first-order nonsmooth nonconvex optimization.

OCSep 21, 2022
On the Complexity of Finding Small Subgradients in Nonsmooth Optimization

Guy Kornowski, Ohad Shamir

We study the oracle complexity of producing $(δ,ε)$-stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimension-free randomized algorithms for producing such points within $\widetilde{O}(1/δε^3)$ first-order oracle calls, we show that no dimension-free rate can be achieved by a deterministic algorithm. On the other hand, we point out that this rate can be derandomized for smooth functions with merely a logarithmic dependence on the smoothness parameter. Moreover, we establish several lower bounds for this task which hold for any randomized algorithm, with or without convexity. Finally, we show how the convergence rate of finding $(δ,ε)$-stationary points can be improved in case the function is convex, a setting which we motivate by proving that in general no finite time algorithm can produce points with small subgradients even for convex functions.

LGFeb 12, 2023
Near-optimal learning with average Hölder smoothness

Steve Hanneke, Aryeh Kontorovich, Guy Kornowski

We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness. Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.

LGSep 29, 2023
Efficient Agnostic Learning with Average Smoothness

Steve Hanneke, Aryeh Kontorovich, Guy Kornowski

We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provided a computationally efficient realizable learning algorithm, both of these results currently lack analogs in the general agnostic (i.e. noisy) case. In this work, we fully close these gaps. First, we provide a distribution-free uniform convergence bound for average-smoothness classes in the agnostic setting. Second, we match the derived sample complexity with a computationally efficient agnostic learning algorithm. Our results, which are stated in terms of the intrinsic geometry of the data and hold over any totally bounded metric space, show that the guarantees recently obtained for realizable learning of average-smooth functions transfer to the agnostic setting. At the heart of our proof, we establish the uniform convergence rate of a function class in terms of its bracketing entropy, which may be of independent interest.

94.6OCApr 15
Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Guy Kornowski, Ohad Shamir

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

LGSep 29, 2024
Differentially Private Bilevel Optimization

Guy Kornowski

We present differentially private (DP) algorithms for bilevel optimization, a problem class that received significant attention lately in various machine learning applications. These are the first algorithms for such problems under standard DP constraints, and are also the first to avoid Hessian computations which are prohibitive in large-scale settings. Under the well-studied setting in which the upper-level is not necessarily convex and the lower-level problem is strongly-convex, our proposed gradient-based $(ε,δ)$-DP algorithm returns a point with hypergradient norm at most $\widetilde{\mathcal{O}}\left((\sqrt{d_\mathrm{up}}/εn)^{1/2}+(\sqrt{d_\mathrm{low}}/εn)^{1/3}\right)$ where $n$ is the dataset size, and $d_\mathrm{up}/d_\mathrm{low}$ are the upper/lower level dimensions. Our analysis covers constrained and unconstrained problems alike, accounts for mini-batch gradients, and applies to both empirical and population losses. As an application, we specialize our analysis to derive a simple private rule for tuning a regularization hyperparameter.

OCSep 16, 2024
On the Hardness of Meaningful Local Guarantees in Nonsmooth Nonconvex Optimization

Guy Kornowski, Swati Padmanabhan, Ohad Shamir

We study the oracle complexity of nonsmooth nonconvex optimization, with the algorithm assumed to have access only to local function information. It has been shown by Davis, Drusvyatskiy, and Jiang (2023) that for nonsmooth Lipschitz functions satisfying certain regularity and strictness conditions, perturbed gradient descent converges to local minimizers asymptotically. Motivated by this result and by other recent algorithmic advances in nonconvex nonsmooth optimization concerning Goldstein stationarity, we consider the question of obtaining a non-asymptotic rate of convergence to local minima for this problem class. We provide the following negative answer to this question: Local algorithms acting on regular Lipschitz functions cannot, in the worst case, provide meaningful local guarantees in terms of function value in sub-exponential time, even when all near-stationary points are global minima. This sharply contrasts with the smooth setting, for which it is well-known that standard gradient methods can do so in a dimension-independent rate. Our result complements the rich body of work in the theoretical computer science literature that provide hardness results conditional on conjectures such as $\mathsf{P}\neq\mathsf{NP}$ or cryptographic assumptions, in that ours holds unconditional of any such assumptions.

GTDec 9, 2024
The Oracle Complexity of Simplex-based Matrix Games: Linear Separability and Nash Equilibria

Guy Kornowski, Ohad Shamir

We study the problem of solving matrix games of the form $\max_{\mathbf{w}\in\mathcal{W}}\min_{\mathbf{p}\inΔ}\mathbf{p}^{\top}A\mathbf{w}$, where $A$ is some matrix and $Δ$ is the probability simplex. This problem encapsulates canonical tasks such as finding a linear separator and computing Nash equilibria in zero-sum games. However, perhaps surprisingly, its inherent complexity (as formalized in the standard framework of oracle complexity [Nemirovski and Yudin, 1983]) is not well-understood. In this work, we first identify different oracle models which are implicitly used by prior algorithms, amounting to multiplying the matrix $A$ by a vector from either one or both sides. We then prove complexity lower bounds for algorithms under both access models, which in particular imply a separation between them. Specifically, we start by showing that algorithms for linear separability based on one-sided multiplications must require $Ω(γ_A^{-2})$ iterations, where $γ_A$ is the margin, as matched by the Perceptron algorithm. We then prove that accelerated algorithms for this task, which utilize multiplications from both sides, must require $\tildeΩ(γ_{A}^{-2/3})$ iterations, establishing the first oracle complexity barrier for such algorithms. Finally, by adapting our lower bound to $\ell_1$ geometry, we prove that computing an $ε$-approximate Nash equilibrium requires $\tildeΩ(ε^{-2/5})$ iterations, which is an exponential improvement over the previously best-known lower bound due to Hadiji et al. [2024].

LGJun 2, 2025
Trade-offs in Data Memorization via Strong Data Processing Inequalities

Vitaly Feldman, Guy Kornowski, Xin Lyu

Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $Ω(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.

LGFeb 11, 2025
Beyond Benign Overfitting in Nadaraya-Watson Interpolators

Daniel Barzilai, Guy Kornowski, Ohad Shamir

In recent years, there has been much interest in understanding the generalization behavior of interpolating predictors, which overfit on noisy training data. Whereas standard analyses are concerned with whether a method is consistent or not, recent observations have shown that even inconsistent predictors can generalize well. In this work, we revisit the classic interpolating Nadaraya-Watson (NW) estimator (also known as Shepard's method), and study its generalization capabilities through this modern viewpoint. In particular, by varying a single bandwidth-like hyperparameter, we prove the existence of multiple overfitting behaviors, ranging non-monotonically from catastrophic, through benign, to tempered. Our results highlight how even classical interpolating methods can exhibit intricate generalization behaviors. In addition, for the purpose of tuning the hyperparameter, the results suggest that over-estimating the intrinsic dimension of the data is less harmful than under-estimating it. Numerical experiments complement our theory, demonstrating the same phenomena.

OCJun 19, 2024
Open Problem: Anytime Convergence Rate of Gradient Descent

Guy Kornowski, Ohad Shamir

Recent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $\mathcal{O}(1/T)$ convergence rate, at \emph{any} stopping time $T$?

OCJun 18, 2024
First-Order Methods for Linearly Constrained Bilevel Optimization

Guy Kornowski, Swati Padmanabhan, Kai Wang et al.

Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present first-order linearly constrained optimization methods with finite-time hypergradient stationarity guarantees. For linear equality constraints, we attain $ε$-stationarity in $\widetilde{O}(ε^{-2})$ gradient oracle calls, which is nearly-optimal. For linear inequality constraints, we attain $(δ,ε)$-Goldstein stationarity in $\widetilde{O}(d{δ^{-1} ε^{-3}})$ gradient oracle calls, where $d$ is the upper-level dimension. Finally, we obtain for the linear inequality setting dimension-free rates of $\widetilde{O}({δ^{-1} ε^{-4}})$ oracle complexity under the additional assumption of oracle access to the optimal dual variable. Along the way, we develop new nonsmooth nonconvex optimization methods with inexact oracles. We verify these guarantees with preliminary numerical experiments.

LGMay 24, 2023
From Tempered to Benign Overfitting in ReLU Neural Networks

Guy Kornowski, Gilad Yehudai, Ohad Shamir

Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand.

OCApr 14, 2021
Oracle Complexity in Nonsmooth Nonconvex Optimization

Guy Kornowski, Ohad Shamir

It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find $ε$-stationary points (with gradient norm less than $ε$) in $\mathcal{O}(1/ε^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near $ε$-stationary points. This is perhaps the most natural relaxation of finding $ε$-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efficiently, for any distance and $ε$ smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large.

OCOct 13, 2020
High-Order Oracle Complexity of Smooth and Strongly Convex Optimization

Guy Kornowski, Ohad Shamir

In this note, we consider the complexity of optimizing a highly smooth (Lipschitz $k$-th order derivative) and strongly convex function, via calls to a $k$-th order oracle which returns the value and first $k$ derivatives of the function at a given point, and where the dimension is unrestricted. Extending the techniques introduced in Arjevani et al. [2019], we prove that the worst-case oracle complexity for any fixed $k$ to optimize the function up to accuracy $ε$ is on the order of $\left(\frac{μ_k D^{k-1}}λ\right)^{\frac{2}{3k+1}}+\log\log\left(\frac{1}ε\right)$ (in sufficiently high dimension, and up to log factors independent of $ε$), where $μ_k$ is the Lipschitz constant of the $k$-th derivative, $D$ is the initial distance to the optimum, and $λ$ is the strong convexity parameter.