PRJun 3
A remark on the majorizing measures theorem for general processesReese Pathak, Nikita Zhivotovskiy · eth-zurich
We show that the lower bound in the majorizing measures theorem holds for a large class of random vectors. Specifically, suppose $X \sim μ$ is a centered random vector in $\mathbf{R}^n$ with \[ C_{\mathrm{KL}}(μ) = \sup_{\substack{θ\neq η\\ θ, η\in \mathbf{R}^n}} \frac{\mathrm{KL}(μ_θ\| μ_η)}{\|θ- η\|_2^2} < \infty, \] where $μ_θ$ denotes the law of the translate $θ+ X$. Then, for every nonempty, bounded $T \subset \mathbf{R}^n$, \[ \sqrt{C_{\mathrm{KL}}(μ)}\, \mathbf{E}_μ\Big[\sup_{t \in T} \, \langle X, t \rangle \Big] \gtrsim γ_2(T), \] where the righthand side denotes Talagrand's generic chaining functional. This result recovers, as a special case, the lower bound in the majorizing measures theorem for centered Gaussian processes. Our argument critically relies on the rate-distortion integral, recently introduced by J. Liu.
PRJun 3
Gaussian Width of Convex Sets via Integral Decompositions, Projections, and the Distribution of Intrinsic VolumesReese Pathak, Nikita Zhivotovskiy
We revisit the problem of bounding the expected supremum of a canonical Gaussian process indexed by a convex set $T \subset \mathbf{R}^d$. We develop two decompositions for the Gaussian width, based on the geometry of the index set. The first decomposition involves metric projections of Gaussians onto rescaled copies of $T$. The second involves fixed points arising from a quadratically penalized variant of the local width. Neither decomposition directly invokes generic chaining constructions. Our results make use of recent work in geometric analysis and Gaussian processes. The work of Chatterjee [Ann. Statist., 2014] characterizes the behavior of the metric projection of a Gaussian random vector onto rescaled copies of $T$ with a variational problem involving localized Gaussian widths. We use these bounds to develop decompositions of the Gaussian width using the local metric structure of $T$. Second, we leverage the work of Vitale [Ann. Probab., 1996] to form a connection between the Wills functional (and hence the intrinsic volumes of $T$) and the first terms that appear in our decompositions. Finally, invoking recent work by Mourtada [J. Eur. Math. Soc., 2025] on the logarithm of the Wills functional, we show that the width is controlled by a single, ''peak index'' of the intrinsic volumes. In the worst case, our bound recovers a local form of the classical Dudley integral.
STMay 6, 2022
Optimally tackling covariate shift in RKHS-based nonparametric regressionCong Ma, Reese Pathak, Martin J. Wainwright
We study the covariate shift problem in the context of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We focus on two natural families of covariate shift problems defined using the likelihood ratios between the source and target distributions. When the likelihood ratios are uniformly bounded, we prove that the kernel ridge regression (KRR) estimator with a carefully chosen regularization parameter is minimax rate-optimal (up to a log factor) for a large family of RKHSs with regular kernel eigenvalues. Interestingly, KRR does not require full knowledge of likelihood ratios apart from an upper bound on them. In striking contrast to the standard statistical setting without covariate shift, we also demonstrate that a naive estimator, which minimizes the empirical risk over the function class, is strictly sub-optimal under covariate shift as compared to KRR. We then address the larger class of covariate shift problems where the likelihood ratio is possibly unbounded yet has a finite second moment. Here, we propose a reweighted KRR estimator that weights samples based on a careful truncation of the likelihood ratios. Again, we are able to show that this estimator is minimax rate-optimal, up to logarithmic factors.
LGNov 14, 2023
Transformers can optimally learn regression mixture modelsReese Pathak, Rajat Sen, Weihao Kong et al.
Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models that present the intriguing possibility of providing general-purpose prediction methods, even in this mixture setting. In this work, we investigate the hypothesis that transformers can learn an optimal predictor for mixtures of regressions. We construct a generative process for a mixture of linear regressions for which the decision-theoretic optimal procedure is given by data-driven exponential weights on a finite set of parameters. We observe that transformers achieve low mean-squared error on data generated via this process. By probing the transformer's output at inference time, we also show that transformers typically make predictions that are close to the optimal predictor. Our experiments also demonstrate that transformers can learn mixtures of regressions in a sample-efficient fashion and are somewhat robust to distribution shifts. We complement our experimental observations by proving constructively that the decision-theoretic optimal procedure is indeed implementable by a transformer.
MEMar 28, 2024
Data-Adaptive Tradeoffs among Multiple Risks in Distribution-Free PredictionDrew T. Nguyen, Reese Pathak, Anastasios N. Angelopoulos et al. · berkeley
Decision-making pipelines are generally characterized by tradeoffs among various risk functions. It is often desirable to manage such tradeoffs in a data-adaptive manner. As we demonstrate, if this is done naively, state-of-the art uncertainty quantification methods can lead to significant violations of putative risk guarantees. To address this issue, we develop methods that permit valid control of risk when threshold and tradeoff parameters are chosen adaptively. Our methodology supports monotone and nearly-monotone risks, but otherwise makes no distributional assumptions. To illustrate the benefits of our approach, we carry out numerical experiments on synthetic data and the large-scale vision dataset MS-COCO.
STFeb 6, 2022
A new similarity measure for covariate shift with applications to nonparametric regressionReese Pathak, Cong Ma, Martin J. Wainwright
We study covariate shift in the context of nonparametric regression. We introduce a new measure of distribution mismatch between the source and target distributions that is based on the integrated ratio of probabilities of balls at a given radius. We use the scaling of this measure with respect to the radius to characterize the minimax rate of estimation over a family of Hölder continuous functions under covariate shift. In comparison to the recently proposed notion of transfer exponent, this measure leads to a sharper rate of convergence and is more fine-grained. We accompany our theory with concrete instances of covariate shift that illustrate this sharp difference.
LGOct 26, 2021
Cluster-and-Conquer: A Framework For Time-Series ForecastingReese Pathak, Rajat Sen, Nikhil Rao et al.
We propose a three-stage framework for forecasting high-dimensional time-series data. Our method first estimates parameters for each univariate time series. Next, we use these parameters to cluster the time series. These clusters can be viewed as multivariate time series, for which we then compute parameters. The forecasted values of a single time series can depend on the history of other time series in the same cluster, accounting for intra-cluster similarity while minimizing potential noise in predictions by ignoring inter-cluster effects. Our framework -- which we refer to as "cluster-and-conquer" -- is highly general, allowing for any time-series forecasting and clustering method to be used in each step. It is computationally efficient and embarrassingly parallel. We motivate our framework with a theoretical analysis in an idealized mixed linear regression setting, where we provide guarantees on the quality of the estimates. We accompany these guarantees with experimental results that demonstrate the advantages of our framework: when instantiated with simple linear autoregressive models, we are able to achieve state-of-the-art results on several benchmark datasets, sometimes outperforming deep-learning-based approaches.
LGMay 11, 2020
FedSplit: An algorithmic framework for fast federated optimizationReese Pathak, Martin J. Wainwright
Motivated by federated learning, we consider the hub-and-spoke model of distributed optimization in which a central authority coordinates the computation of a solution among many agents while limiting communication. We first study some past procedures for federated optimization, and show that their fixed points need not correspond to stationary points of the original optimization problem, even in simple convex settings with deterministic updates. In order to remedy these issues, we introduce FedSplit, a class of algorithms based on operator splitting procedures for solving distributed convex minimization with additive structure. We prove that these procedures have the correct fixed points, corresponding to optima of the original optimization problem, and we characterize their convergence rates under different settings. Our theory shows that these methods are provably robust to inexact computation of intermediate local quantities. We complement our theory with some simple experiments that demonstrate the benefits of our methods in practice.