Mark Sellke

LG
h-index70
23papers
793citations
Novelty69%
AI Score61

23 Papers

SYNov 23, 2021
Exact minimum number of bits to stabilize a linear system

Victoria Kostina, Yuval Peres, Gireeja Ranade et al.

We consider an unstable scalar linear stochastic system, $X_{n+1}=a X_n + Z_n - U_n$, where $a \geq 1$ is the system gain, $Z_n$'s are independent random variables with bounded $α$-th moments, and $U_n$'s are the control actions that are chosen by a controller who receives a single element of a finite set $\{1, \ldots, M\}$ as its only information about system state $X_i$. We show new proofs that $M > a$ is necessary and sufficient for $β$-moment stability, for any $β< α$. Our achievable scheme is a uniform quantizer of the zoom-in / zoom-out type that codes over multiple time instants for data rate efficiency; the controller uses its memory of the past to correctly interpret the received bits. We analyze its performance using probabilistic arguments. We show a simple proof of a matching converse using information-theoretic techniques. Our results generalize to vector systems, to systems with dependent Gaussian noise, and to the scenario in which a small fraction of transmitted messages is lost.

SYMay 15, 2018
Stabilizing a system with an unbounded random gain using only a finite number of bits

Victoria Kostina, Yuval Peres, Gireeja Ranade et al.

We study the stabilization of an unpredictable linear control system where the controller must act based on a rate-limited observation of the state. More precisely, we consider the system $X_{n+1} = A_n X_n + W_n - U_n$, where the $A_n$'s are drawn independently at random at each time $n$ from a known distribution with unbounded support, and where the controller receives at most $R$ bits about the system state at each time from an encoder. We provide a time-varying achievable strategy to stabilize the system in a second-moment sense with fixed, finite $R$. While our previous result provided a strategy to stabilize this system using a variable-rate code, this work provides an achievable strategy using a fixed-rate code. The strategy we employ to achieve this is time-varying and takes different actions depending on the value of the state. It proceeds in two modes: a normal mode (or zoom-in), where the realization of $A_n$ is typical, and an emergency mode (or zoom-out), where the realization of $A_n$ is exceptionally large.

QUANT-PHJun 10, 2022
When Does Adaptivity Help for Quantum State Learning?

Sitan Chen, Brice Huang, Jerry Li et al.

We consider the classic question of state tomography: given copies of an unknown quantum state $ρ\in\mathbb{C}^{d\times d}$, output $\widehatρ$ which is close to $ρ$ in some sense, e.g. trace distance or fidelity. When one is allowed to make coherent measurements entangled across all copies, $Θ(d^2/ε^2)$ copies are necessary and sufficient to get trace distance $ε$. Unfortunately, the protocols achieving this rate incur large quantum memory overheads that preclude implementation on near-term devices. On the other hand, the best known protocol using incoherent (single-copy) measurements uses $O(d^3/ε^2)$ copies, and multiple papers have posed it as an open question to understand whether or not this rate is tight. In this work, we fully resolve this question, by showing that any protocol using incoherent measurements, even if they are chosen adaptively, requires $Ω(d^3/ε^2)$ copies, matching the best known upper bound. We do so by a new proof technique which directly bounds the ``tilt'' of the posterior distribution after measurements, which yields a surprisingly short proof of our lower bound, and which we believe may be of independent interest. While this implies that adaptivity does not help for tomography with respect to trace distance, we show that it actually does help for tomography with respect to infidelity. We give an adaptive algorithm that outputs a state which is $γ$-close in infidelity to $ρ$ using only $\tilde{O}(d^3/γ)$ copies, which is optimal for incoherent measurements. In contrast, it is known that any nonadaptive algorithm requires $Ω(d^3/γ^2)$ copies. While it is folklore that in $2$ dimensions, one can achieve a scaling of $O(1/γ)$, to the best of our knowledge, our algorithm is the first to achieve the optimal rate in all dimensions.

LGJun 3, 2023
On Size-Independent Sample Complexity of ReLU Networks

Mark Sellke

We study the sample complexity of learning ReLU neural networks from the point of view of generalization. Given norm constraints on the weight matrices, a common approach is to estimate the Rademacher complexity of the associated function class. Previously Golowich-Rakhlin-Shamir (2020) obtained a bound independent of the network size (scaling with a product of Frobenius norms) except for a factor of the square-root depth. We give a refinement which often has no explicit depth-dependence at all.

GTJun 3, 2023
Incentivizing Exploration with Linear Contexts and Combinatorial Actions

Mark Sellke

We advance the study of incentivized bandit exploration, in which arm choices are viewed as recommendations and are required to be Bayesian incentive compatible. Recent work has shown under certain independence assumptions that after collecting enough initial samples, the popular Thompson sampling algorithm becomes incentive compatible. We give an analog of this result for linear bandits, where the independence of the prior is replaced by a natural convexity condition. This opens up the possibility of efficient and regret-optimal incentivized exploration in high-dimensional action spaces. In the semibandit model, we also improve the sample complexity for the pre-Thompson sampling phase of initial data collection.

LGJun 3, 2023
Asymptotically Optimal Pure Exploration for Infinite-Armed Bandits

Xiao-Yue Gong, Mark Sellke

We study pure exploration with infinitely many bandit arms generated i.i.d. from an unknown distribution. Our goal is to efficiently select a single high quality arm whose average reward is, with probability $1-δ$, within $\varepsilon$ of being among the top $η$-fraction of arms; this is a natural adaptation of the classical PAC guarantee for infinite action sets. We consider both the fixed confidence and fixed budget settings, aiming respectively for minimal expected and fixed sample complexity. For fixed confidence, we give an algorithm with expected sample complexity $O\left(\frac{\log (1/η)\log (1/δ)}{η\varepsilon^2}\right)$. This is optimal except for the $\log (1/η)$ factor, and the $δ$-dependence closes a quadratic gap in the literature. For fixed budget, we show the asymptotically optimal sample complexity as $δ\to 0$ is $c^{-1}\log(1/δ)\big(\log\log(1/δ)\big)^2$ to leading order. Equivalently, the optimal failure probability given exactly $N$ samples decays as $\exp\big(-cN/\log^2 N\big)$, up to a factor $1\pm o_N(1)$ inside the exponent. The constant $c$ depends explicitly on the problem parameters (including the unknown arm distribution) through a certain Fisher information distance. Even the strictly super-linear dependence on $\log(1/δ)$ was not known and resolves a question of Grossman and Moshkovitz (FOCS 2016, SIAM Journal on Computing 2020).

97.6PRMay 12
A Counterexample to the Gaussian Completely Monotone Conjecture

Yuzhou Gu, Mark Sellke

We provide an explicit probability measure on $\mathbb{R}$ for which the fifth time derivative of the entropy along the heat flow is positive at some time. This disproves the Gaussian completely monotone (GCM) conjecture (Cheng-Geng '15) and therefore also the Gaussian optimality conjecture (McKean '66) and the entropy power conjecture (Toscani '15). Our proof also implies the existence of a log-concave probability measure on $\mathbb{R}$ for which the GCM conjecture fails at some order. The explicit counterexample was found by GPT-5.5 Pro.

90.2CCMar 31
Stable algorithms cannot reliably find isolated perceptron solutions

Shuyang Gong, Brice Huang, Shuangping Li et al.

We study the binary perceptron, a random constraint satisfaction problem that asks to find a Boolean vector in the intersection of independently chosen random halfspaces. A striking feature of this model is that at every positive constraint density, it is expected that a $1-o_N(1)$ fraction of solutions are \emph{strongly isolated}, i.e. separated from all others by Hamming distance $Ω(N)$. At the same time, efficient algorithms are known to find solutions at certain positive constraint densities. This raises a natural question: can any isolated solution be algorithmically visible? We answer this in the negative: no algorithm whose output is stable under a tiny Gaussian resampling of the disorder can \emph{reliably} locate isolated solutions. We show that any stable algorithm has success probability at most $\frac{3\sqrt{17}-9}{4}+o_N(1)\leq 0.84233$. Furthermore, every stable algorithm that finds a solution with probability $1-o_N(1)$ finds an isolated solution with probability $o_N(1)$. The class of stable algorithms we consider includes degree-$D$ polynomials up to $D\leq o(N/\log N)$; under the low-degree heuristic \cite{hopkins2018statistical}, this suggests that locating strongly isolated solutions requires running time $\exp(\widetildeΘ(N))$. Our proof does not use the overlap gap property. Instead, we show via Pitt's correlation inequality that after a random perturbation of the disorder, the number of solutions located close to a pre-existing isolated solution cannot concentrate at $1$.

78.7DIS-NNMar 31
Strong Low Degree Hardness for Stable Local Optima in Spin Glasses

Brice Huang, Mark Sellke

It is a folklore belief in the theory of spin glasses and disordered systems that out-of-equilibrium dynamics fail to find stable local optima exhibiting e.g. local strict convexity on physical time-scales. In the context of the Sherrington--Kirkpatrick spin glass, Behrens-Arpino-Kivva-Zdeborová and Minzer-Sah-Sawhney have recently conjectured that this obstruction may be inherent to all efficient algorithms, despite the existence of exponentially many such optima throughout the landscape. We prove this search problem exhibits strong low degree hardness for polynomial algorithms of degree $D\leq o(N)$: any such algorithm has probability $o(1)$ to output a stable local optimum. To the best of our knowledge, this is the first result to prove that even constant-degree polynomials have probability $o(1)$ to solve a random search problem without planted structure. To prove this, we develop a general-purpose enhancement of the ensemble overlap gap property, and as a byproduct improve previous results on spin glass optimization, maximum independent set, random $k$-SAT, and the Ising perceptron to strong low degree hardness. Finally for spherical spin glasses with no external field, we prove that Langevin dynamics does not find stable local optima within dimension-free time.

MLFeb 2, 2024
No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Tanishq Kumar, Kevin Luo, Mark Sellke

The existence of "lottery tickets" arXiv:1803.03635 at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model ("pruning at initialization") have been broadly unsuccessful arXiv:2009.08576. We put forward a theoretical explanation for this, based on the model's effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of arXiv:2105.12806 extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.

STDec 11, 2025
On Learning-Curve Monotonicity for Maximum Likelihood Estimators

Mark Sellke, Steven Yin

The property of learning-curve monotonicity, highlighted in a recent series of work by Loog, Mey and Viering, describes algorithms which only improve in average performance given more data, for any underlying data distribution within a given family. We establish the first nontrivial monotonicity guarantees for the maximum likelihood estimator in a variety of well-specified parametric settings. For sequential prediction with log loss, we show monotonicity (in fact complete monotonicity) of the forward KL divergence for Gaussian vectors with unknown covariance and either known or unknown mean, as well as for Gamma variables with unknown scale parameter. The Gaussian setting was explicitly highlighted as open in the aforementioned works, even in dimension 1. Finally we observe that for reverse KL divergence, a folklore trick yields monotonicity for very general exponential families. All results in this paper were derived by variants of GPT-5.2 Pro. Humans did not provide any proof strategies or intermediate arguments, but only prompted the model to continue developing additional results, and verified and transcribed its proofs.

CLNov 20, 2025
Early science acceleration experiments with GPT-5

Sébastien Bubeck, Christian Coester, Ronen Eldan et al.

AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.

GTJun 2, 2025
Geometry Meets Incentives: Sample-Efficient Incentivized Exploration with Linear Contexts

Benjamin Schiffer, Mark Sellke

In the incentivized exploration model, a principal aims to explore and learn over time by interacting with a sequence of self-interested agents. It has been recently understood that the main challenge in designing incentive-compatible algorithms for this problem is to gather a moderate amount of initial data, after which one can obtain near-optimal regret via posterior sampling. With high-dimensional contexts, however, this \emph{initial exploration} phase requires exponential sample complexity in some cases, which prevents efficient learning unless initial data can be acquired exogenously. We show that these barriers to exploration disappear under mild geometric conditions on the set of available actions, in which case incentive-compatibility does not preclude regret-optimality. Namely, we consider the linear bandit model with actions in the Euclidean unit ball, and give an incentive-compatible exploration algorithm with sample complexity that scales polynomially with the dimension and other parameters.

LGFeb 19, 2022
The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication

Allen Liu, Mark Sellke

We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/Δ)$ where $Δ$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. Surprisingly, we show that with no communication at all, such guarantees are not achievable. In fact, obtaining the optimal $\tilde{O}(1/Δ)$ regret for some values of $Δ$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author. As there, our algorithm succeeds even when feedback upon collision can be corrupted by an adaptive adversary, thanks to a strong no-collision property. Our lower bound is based on topological obstructions at multiple scales and is completely new.

LGJun 18, 2021
Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments

Yining Chen, Elan Rosenfeld, Mark Sellke et al.

Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposal algorithms for this task, assessing their performance both theoretically and empirically is still very challenging. Distributional matching algorithms such as (Conditional) Domain Adversarial Networks [Ganin et al., 2016, Long et al., 2018] are popular and enjoy empirical success, but they lack formal guarantees. Other approaches such as Invariant Risk Minimization (IRM) require a prohibitively large number of training environments -- linear in the dimension of the spurious feature space $d_s$ -- even on simple data models like the one proposed by [Rosenfeld et al., 2021]. Under a variant of this model, we show that both ERM and IRM cannot generalize with $o(d_s)$ environments. We then present an iterative feature matching algorithm that is guaranteed with high probability to yield a predictor that generalizes after seeing only $O(\log d_s)$ environments. Our results provide the first theoretical justification for a family of distribution-matching algorithms widely used in practice under a concrete nontrivial data model.

LGMay 26, 2021
A Universal Law of Robustness via Isoperimetry

Sébastien Bubeck, Mark Sellke

Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a partial theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.

CGNov 23, 2020
Metric Transforms and Low Rank Matrices via Representation Theory of the Real Hyperrectangle

Josh Alman, Timothy Chu, Gary Miller et al.

In this paper, we develop a new technique which we call representation theory of the real hyperrectangle, which describes how to compute the eigenvectors and eigenvalues of certain matrices arising from hyperrectangles. We show that these matrices arise naturally when analyzing a number of different algorithmic tasks such as kernel methods, neural network training, natural language processing, and the design of algorithms using the polynomial method. We then use our new technique along with these connections to prove several new structural results in these areas, including: $\bullet$ A function is a positive definite Manhattan kernel if and only if it is a completely monotone function. These kernels are widely used across machine learning; one example is the Laplace kernel which is widely used in machine learning for chemistry. $\bullet$ A function transforms Manhattan distances to Manhattan distances if and only if it is a Bernstein function. This completes the theory of Manhattan to Manhattan metric transforms initiated by Assouad in 1980. $\bullet$ A function applied entry-wise to any square matrix of rank $r$ always results in a matrix of rank $< 2^{r-1}$ if and only if it is a polynomial of sufficiently low degree. This gives a converse to a key lemma used by the polynomial method in algorithm design. Our work includes a sophisticated combination of techniques from different fields, including metric embeddings, the polynomial method, and group representation theory.

LGNov 8, 2020
Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions

Sébastien Bubeck, Thomas Budzinski, Mark Sellke

We consider the cooperative multi-player version of the stochastic multi-armed bandit problem. We study the regime where the players cannot communicate but have access to shared randomness. In prior work by the first two authors, a strategy for this regime was constructed for two players and three arms, with regret $\tilde{O}(\sqrt{T})$, and with no collisions at all between the players (with very high probability). In this paper we show that these properties (near-optimal regret and no collisions at all) are achievable for any number of players and arms. At a high level, the previous strategy heavily relied on a $2$-dimensional geometric intuition that was difficult to generalize in higher dimensions, while here we take a more combinatorial route to build the new strategy.

DSApr 15, 2020
Online Multiserver Convex Chasing and Optimization

Sébastien Bubeck, Yuval Rabani, Mark Sellke

We introduce the problem of $k$-chasing of convex functions, a simultaneous generalization of both the famous k-server problem in $R^d$, and of the problem of chasing convex bodies and functions. Aside from fundamental interest in this general form, it has natural applications to online $k$-clustering problems with objectives such as $k$-median or $k$-means. We show that this problem exhibits a rich landscape of behavior. In general, if both $k > 1$ and $d > 1$ there does not exist any online algorithm with bounded competitiveness. By contrast, we exhibit a class of nicely behaved functions (which include in particular the above-mentioned clustering problems), for which we show that competitive online algorithms exist, and moreover with dimension-free competitive ratio. We also introduce a parallel question of top-$k$ action regret minimization in the realm of online convex optimization. There, too, a much rougher landscape emerges for $k > 1$. While it is possible to achieve vanishing regret, unlike the top-one action case the rate of vanishing does not speed up for strongly convex functions. Moreover, vanishing regret necessitates both intractable computations and randomness. Finally we leave open whether almost dimension-free regret is achievable for $k > 1$ and general convex losses. As evidence that it might be possible, we prove dimension-free regret for linear losses via an information-theoretic argument.

GTFeb 3, 2020
The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity

Mark Sellke, Aleksandrs Slivkins

We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily large depending on the Bayesian priors, and scale exponentially in the number of arms. A more basic problem of sampling each arm once runs into similar factors. We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility. We prove that Thompson Sampling, a standard bandit algorithm, is incentive-compatible if initialized with sufficiently many data points. The performance loss due to incentives is therefore limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity: how many rounds are needed? We address this question, providing matching upper and lower bounds and instantiating them in various corollaries. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the "strength of beliefs".

LGApr 28, 2019
Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without

Sébastien Bubeck, Yuanzhi Li, Yuval Peres et al.

We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit problem. The model assumes no communication at all between the players, and furthermore when two (or more) players select the same action this results in a maximal loss. We prove the first $\sqrt{T}$-type regret guarantee for this problem, under the feedback model where collisions are announced to the colliding players. Such a bound was not known even for the simpler stochastic version. We also prove the first sublinear guarantee for the feedback model where collision information is not available, namely $T^{1-\frac{1}{2m}}$ where $m$ is the number of players.

LGFeb 2, 2019
First-Order Bayesian Regret Analysis of Thompson Sampling

Sébastien Bubeck, Mark Sellke

We address online combinatorial optimization when the player has a prior over the adversary's sequence of losses. In this framework, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the information ratio, resulting in optimal worst-case regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form $\sqrt{L^*}$ where $L^*$ is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. Finally, we introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound $\tilde{O}(\sqrt{d L^*})$ still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound $\tilde{O}(\sqrt{(d+m^3)L^*})$ by Lykouris, Sridharan and Tardos. Moreover we sharpen these results with two technical ingredients. The first leverages a recent insight of Zimmert and Lattimore to replace Shannon entropy with more refined potential functions in the analysis. The second is a \emph{Thresholded} Thompson sampling algorithm, which slightly modifies the original algorithm by never playing low-probability actions. This thresholding results in fully $T$-independent regret bounds when $L^*$ is almost surely upper-bounded, which we show does not hold for ordinary Thompson sampling.

MLOct 31, 2017
Approximating Continuous Functions by ReLU Nets of Minimal Width

Boris Hanin, Mark Sellke

This article concerns the expressive power of depth in deep feed-forward neural nets with ReLU activations. Specifically, we answer the following question: for a fixed $d_{in}\geq 1,$ what is the minimal width $w$ so that neural nets with ReLU activations, input dimension $d_{in}$, hidden layer widths at most $w,$ and arbitrary depth can approximate any continuous, real-valued function of $d_{in}$ variables arbitrarily well? It turns out that this minimal width is exactly equal to $d_{in}+1.$ That is, if all the hidden layer widths are bounded by $d_{in}$, then even in the infinite depth limit, ReLU nets can only express a very limited class of functions, and, on the other hand, any continuous function on the $d_{in}$-dimensional unit cube can be approximated to arbitrary precision by ReLU nets in which all hidden layers have width exactly $d_{in}+1.$ Our construction in fact shows that any continuous function $f:[0,1]^{d_{in}}\to\mathbb R^{d_{out}}$ can be approximated by a net of width $d_{in}+d_{out}$. We obtain quantitative depth estimates for such an approximation in terms of the modulus of continuity of $f$.