MLMay 28
Improved Distribution Estimation in $\ell_\infty$Doron Cohen, Aryeh Kontorovich, Yonatan Livshitz
We present improved bounds for estimating discrete probability distributions under the $\ell_\infty$ norm. These include minimax bounds in expectation and high-probability tail bounds. We resolve some of the open questions posed in Kontorovich and Painsky (JMLR, 2025) -- including a fully empirical version of the tightest risk bound they presented and identifying the form of the worst-case extremal distribution. Encouraging empirical results are reported as well.
LGDec 8, 2022
Differentially-Private Bayes ConsistencyOlivier Bousquet, Haim Kaplan, Aryeh Kontorovich et al.
We construct a universally Bayes consistent learning rule that satisfies differential privacy (DP). We first handle the setting of binary classification and then extend our rule to the more general setting of density estimation (with respect to the total variation metric). The existence of a universally consistent DP learner reveals a stark difference with the distribution-free PAC model. Indeed, in the latter DP learning is extremely limited: even one-dimensional linear classifiers are not privately learnable in this stringent model. Our result thus demonstrates that by allowing the learning rate to depend on the target distribution, one can circumvent the above-mentioned impossibility result and in fact, learn \emph{arbitrary} distributions by a single DP algorithm. As an application, we prove that any VC class can be privately learned in a semi-supervised setting with a near-optimal \emph{labeled} sample complexity of $\tilde{O}(d/\varepsilon)$ labeled examples (and with an unlabeled sample complexity that can depend on the target distribution).
LGOct 3, 2023
Splitting the Difference on Adversarial TrainingMatan Levi, Aryeh Kontorovich
The existence of adversarial examples points to a basic weakness of deep neural networks. One of the most effective defenses against such examples, adversarial training, entails training models with some degree of robustness, usually at the expense of a degraded natural accuracy. Most adversarial training methods aim to learn a model that finds, for each class, a common decision boundary encompassing both the clean and perturbed examples. In this work, we take a fundamentally different approach by treating the perturbed examples of each class as a separate class to be learned, effectively splitting each class into two classes: "clean" and "adversarial." This split doubles the number of classes to be learned, but at the same time considerably simplifies the decision boundaries. We provide a theoretical plausibility argument that sheds some light on the conditions under which our approach can be expected to be beneficial. Likewise, we empirically demonstrate that our method learns robust models while attaining optimal or near-optimal natural accuracy, e.g., on CIFAR-10 we obtain near-optimal natural accuracy of $95.01\%$ alongside significant robustness across multiple tasks. The ability to achieve such near-optimal natural accuracy, while maintaining a significant level of robustness, makes our method applicable to real-world applications where natural accuracy is at a premium. As a whole, our main contribution is a general method that confers a significant level of robustness upon classifiers with only minor or negligible degradation of their natural accuracy.
LGFeb 12, 2023
Near-optimal learning with average Hölder smoothnessSteve Hanneke, Aryeh Kontorovich, Guy Kornowski
We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to Hölder smoothness. This measure of the "effective smoothness" of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic "worst-case" Hölder constant. We consider both the realizable and the agnostic (noisy) regression settings, proving upper and lower risk bounds in terms of the average Hölder smoothness; these rates improve upon both previously known rates even in the special case of average Lipschitz smoothness. Moreover, our lower bound is tight in the realizable setting up to log factors, thus we establish the minimax rate. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown underlying distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide distinct learning algorithms that achieve both (nearly) optimal learning rates. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of Hölder smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.
PRJul 23, 2024
Sharp bounds on aggregate expert errorAryeh Kontorovich, Ariel Avital
We revisit the classic problem of aggregating binary advice from conditionally independent experts, also known as the Naive Bayes setting. Our quantity of interest is the error probability of the optimal decision rule. In the case of symmetric errors (sensitivity = specificity), reasonably tight bounds on the optimal error probability are known. In the general asymmetric case, we are not aware of any nontrivial estimates on this quantity. Our contribution consists of sharp upper and lower bounds on the optimal error probability in the general case, which recover and sharpen the best known results in the symmetric special case. Since this turns out to be equivalent to estimating the total variation distance between two product distributions, our results also have bearing on this important and challenging problem.
LGSep 29, 2023
Efficient Agnostic Learning with Average SmoothnessSteve Hanneke, Aryeh Kontorovich, Guy Kornowski
We study distribution-free nonparametric regression following a notion of average smoothness initiated by Ashlagi et al. (2021), which measures the "effective" smoothness of a function with respect to an arbitrary unknown underlying distribution. While the recent work of Hanneke et al. (2023) established tight uniform convergence bounds for average-smooth functions in the realizable case and provided a computationally efficient realizable learning algorithm, both of these results currently lack analogs in the general agnostic (i.e. noisy) case. In this work, we fully close these gaps. First, we provide a distribution-free uniform convergence bound for average-smoothness classes in the agnostic setting. Second, we match the derived sample complexity with a computationally efficient agnostic learning algorithm. Our results, which are stated in terms of the intrinsic geometry of the data and hold over any totally bounded metric space, show that the guarantees recently obtained for realizable learning of average-smooth functions transfer to the agnostic setting. At the heart of our proof, we establish the uniform convergence rate of a function class in terms of its bracketing entropy, which may be of independent interest.
LGJul 18, 2025Code
ParallelTime: Dynamically Weighting the Balance of Short- and Long-Term Temporal DependenciesItay Katav, Aryeh Kontorovich
Modern multivariate time series forecasting primarily relies on two architectures: the Transformer with attention mechanism and Mamba. In natural language processing, an approach has been used that combines local window attention for capturing short-term dependencies and Mamba for capturing long-term dependencies, with their outputs averaged to assign equal weight to both. We find that for time-series forecasting tasks, assigning equal weight to long-term and short-term dependencies is not optimal. To mitigate this, we propose a dynamic weighting mechanism, ParallelTime Weighter, which calculates interdependent weights for long-term and short-term dependencies for each token based on the input and the model's knowledge. Furthermore, we introduce the ParallelTime architecture, which incorporates the ParallelTime Weighter mechanism to deliver state-of-the-art performance across diverse benchmarks. Our architecture demonstrates robustness, achieves lower FLOPs, requires fewer parameters, scales effectively to longer prediction horizons, and significantly outperforms existing methods. These advances highlight a promising path for future developments of parallel Attention-Mamba in time series forecasting. The implementation is readily available at: \href{https://github.com/itay1551/ParallelTime}{GitHub}.
CRMay 29, 2017Code
Temporal anomaly detection: calibrating the surpriseEyal Gutflaish, Aryeh Kontorovich, Sivan Sabato et al.
We propose a hybrid approach to temporal anomaly detection in access data of users to databases --- or more generally, any kind of subject-object co-occurrence data. We consider a high-dimensional setting that also requires fast computation at test time. Our methodology identifies anomalies based on a single stationary model, instead of requiring a full temporal one, which would be prohibitive in this setting. We learn a low-rank stationary model from the training data, and then fit a regression model for predicting the expected likelihood score of normal access patterns in the future. The disparity between the predicted likelihood score and the observed one is used to assess the `surprise' at test time. This approach enables calibration of the anomaly score, so that time-varying normal behavior patterns are not considered anomalous. We provide a detailed description of the algorithm, including a convergence analysis, and report encouraging empirical results. One of the data sets that we tested, TDA, is new for the public domain. It consists of two months' worth of database access records from a live system. Our code is publicly available at https://github.com/eyalgut/TLR_anomaly_detection.git. The TDA data set is available at https://www.kaggle.com/eyalgut/binary-traffic-matrices.
LGMay 7
A Fine-Grained Understanding of Uniform Convergence for HalfspacesAryeh Kontorovich, Kasper Green Larsen
We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in $\mathbb{R}^d$ with $d\ge 2$, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error $Θ(d\ln(n/d)/n)$, and in the agnostic setting the deviation scales as $\sqrt{τ\ln(1/τ)}$ at true error $τ$. In contrast, homogeneous halfspaces in $\mathbb{R}^2$ exhibit a markedly different behavior. In the realizable case, every hypothesis consistent with the sample has error $O(1/n)$. In the agnostic case, we prove a bandwise, log-free deviation bound on each dyadic risk band via a critical-wedge localization argument. Unioning over bands incurs only a $\ln\ln n$ overhead, and we establish a matching lower bound showing this overhead is unavoidable. Together, these results give a fine-grained and nearly complete picture of uniform convergence for halfspaces, revealing sharp dimensional and structural thresholds.
LGMay 5
Realizable Bayes-Consistency for General Metric LossesDan Tsir Cohen, Steve Hanneke, Aryeh Kontorovich
We study strong universal Bayes-consistency in the realizable setting for learning with general metric losses, extending classical characterizations beyond $0$-$1$ classification \citep{bousquet_theory_2021, hanneke2021universalbayesconsistencymetric} and real-valued regression \citep{attias_universal_2024}. Given an instance space $(\mathcal X,ρ)$, a label space $(\mathcal Y,\ell)$ with possibly unbounded loss, and a hypothesis class $\mathcal H \subseteq \mathcal Y^{\mathcal X}$, we resolve the realizable case of an open problem presented in \citet{pmlr-v178-cohen22a}. Specifically, we find the necessary and sufficient conditions on the hypothesis class $\mathcal H$ under which there exists a distribution-free learning rule whose risk converges almost surely to the best-in-class risk (which is zero) for every realizable data-generating distribution. Our main contribution is this sharp characterization in terms of a combinatorial obstruction: Similarly to \citet{attias2024optimallearnersrealizableregression}, we introduce the notion of an infinite non-decreasing $(γ_k)$-Littlestone tree, where $γ_k \to \infty$. This extends the Littlestone tree structure used in \citet{bousquet_theory_2021} to the metric loss setting.
STFeb 13, 2024
Distribution Estimation under the Infinity NormAryeh Kontorovich, Amichai Painsky
We present novel bounds for estimating discrete probability distributions under the $\ell_\infty$ norm. These are nearly optimal in various precise senses, including a kind of instance-optimality. Our data-dependent convergence guarantees for the maximum likelihood estimator significantly improve upon the currently known results. A variety of techniques are utilized and innovated upon, including Chernoff-type inequalities and empirical Bernstein bounds. We illustrate our results in synthetic and real-world experiments. Finally, we apply our proposed framework to a basic selective inference problem, where we estimate the most frequent probabilities in a sample.
LGFeb 7, 2022
Metric-valued regressionDan Tsir Cohen, Aryeh Kontorovich
We propose an efficient algorithm for learning mappings between two metric spaces, $\X$ and $\Y$. Our procedure is strongly Bayes-consistent whenever $\X$ and $\Y$ are topologically separable and $\Y$ is "bounded in expectation" (our term; the separability assumption can be somewhat weakened). At this level of generality, ours is the first such learnability result for unbounded loss in the agnostic setting. Our technique is based on metric medoids (a variant of Fréchet means) and presents a significant departure from existing methods, which, as we demonstrate, fail to achieve Bayes-consistency on general instance- and label-space metrics. Our proofs introduce the technique of {\em semi-stable compression}, which may be of independent interest.
LGJan 21, 2022
Adaptive Data Analysis with Correlated ObservationsAryeh Kontorovich, Menachem Sadigurschi, Uri Stemmer
The vast majority of the work on adaptive data analysis focuses on the case where the samples in the dataset are independent. Several approaches and tools have been successfully applied in this context, such as differential privacy, max-information, compression arguments, and more. The situation is far less well-understood without the independence assumption. We embark on a systematic study of the possibilities of adaptive data analysis with correlated observations. First, we show that, in some cases, differential privacy guarantees generalization even when there are dependencies within the sample, which we quantify using a notion we call Gibbs-dependence. We complement this result with a tight negative example. Second, we show that the connection between transcript-compression and adaptive data analysis can be extended to the non-iid setting.
STNov 23, 2021
Tree density estimationLászló Györfi, Aryeh Kontorovich, Roi Weiss
We study the problem of estimating the density $f(\boldsymbol x)$ of a random vector ${\boldsymbol X}$ in $\mathbb R^d$. For a spanning tree $T$ defined on the vertex set $\{1,\dots ,d\}$, the tree density $f_{T}$ is a product of bivariate conditional densities. An optimal spanning tree minimizes the Kullback-Leibler divergence between $f$ and $f_{T}$. From i.i.d. data we identify an optimal tree $T^*$ and efficiently construct a tree density estimate $f_n$ such that, without any regularity conditions on the density $f$, one has $\lim_{n\to \infty} \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x=0$ a.s. For Lipschitz $f$ with bounded support, $\mathbb E \left\{ \int |f_n(\boldsymbol x)-f_{T^*}(\boldsymbol x)|d\boldsymbol x\right\}=O\big(n^{-1/4}\big)$, a dimension-free rate.
FAOct 10, 2021
Fat-Shattering Dimension of $k$-fold AggregationsIdan Attias, Aryeh Kontorovich
We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on $k$. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature.
LGApr 1, 2021
Domain Invariant Adversarial LearningMatan Levi, Idan Attias, Aryeh Kontorovich
The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most effective strategy for learning robust models. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to further optimize the trade-off between robust and standard accuracy by enforcing a domain-invariant feature representation. We present a new adversarial training method, Domain Invariant Adversarial Learning (DIAL), which learns a feature representation that is both robust and domain invariant. DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural domain and its corresponding adversarial domain. In the case where the source domain consists of natural examples and the target domain is the adversarially perturbed examples, our method learns a feature representation constrained not to discriminate between the natural and adversarial examples, and can therefore achieve a more robust representation. DIAL is a generic and modular technique that can be easily incorporated into any adversarial training method. Our experiments indicate that incorporating DIAL in the adversarial training process improves both robustness and standard accuracy.
LGNov 9, 2020
Stable Sample Compression Schemes: New Applications and an Optimal SVM Margin BoundSteve Hanneke, Aryeh Kontorovich
We analyze a family of supervised learning algorithms based on sample compression schemes that are stable, in the sense that removing points from the training set which were not selected for the compression set does not alter the resulting classifier. We use this technique to derive a variety of novel or improved data-dependent generalization bounds for several learning algorithms. In particular, we prove a new margin bound for SVM, removing a log factor. The new bound is provably optimal. This resolves a long-standing open question about the PAC margin bounds achievable by SVM.
LGOct 19, 2020
Non-parametric Binary regression in metric spaces with KL lossAriel Avital, Klim Efremenko, Aryeh Kontorovich et al.
We propose a non-parametric variant of binary regression, where the hypothesis is regularized to be a Lipschitz function taking a metric space to [0,1] and the loss is logarithmic. This setting presents novel computational and statistical challenges. On the computational front, we derive a novel efficient optimization algorithm based on interior point methods; an attractive feature is that it is parameter-free (i.e., does not require tuning an update step size). On the statistical front, the unbounded loss function presents a problem for classic generalization bounds, based on covering-number and Rademacher techniques. We get around this challenge via an adaptive truncation approach, and also present a lower bound indicating that the truncation is, in some sense, necessary.
STJul 13, 2020
Functions with average smoothness: structure, algorithms, and learningYair Ashlagi, Lee-Ad Gottlieb, Aryeh Kontorovich
We initiate a program of average smoothness analysis for efficiently learning real-valued functions on metric spaces. Rather than using the Lipschitz constant as the regularizer, we define a local slope at each point and gauge the function complexity as the average of these values. Since the mean can be dramatically smaller than the maximum, this complexity measure can yield considerably sharper generalization bounds -- assuming that these admit a refinement where the Lipschitz constant is replaced by our average of local slopes. Our first major contribution is to obtain just such distribution-sensitive bounds. This required overcoming a number of technical challenges, perhaps the most formidable of which was bounding the {\em empirical} covering numbers, which can be much worse-behaved than the ambient ones. Our combinatorial results are accompanied by efficient algorithms for smoothing the labels of the random sample, as well as guarantees that the extension from the sample to the whole space will continue to be, with high probability, smooth on average. Along the way we discover a surprisingly rich combinatorial and analytic structure in the function class we define.
LGMar 30, 2020
On Biased Random Walks, Corrupted Intervals, and Learning Under Adversarial DesignDaniel Berend, Aryeh Kontorovich, Lev Reyzin et al.
We tackle some fundamental problems in probability theory on corrupted random processes on the integer line. We analyze when a biased random walk is expected to reach its bottommost point and when intervals of integer points can be detected under a natural model of noise. We apply these results to problems in learning thresholds and intervals under a new model for learning under adversarial design.
LGFeb 5, 2020
Nested Barycentric Coordinate System as an Explicit Feature MapLee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich et al.
We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices. The decision boundary may be highly non-linear, though it is linear within each simplex (and hence piecewise-linear overall). Further, our method can approximate any convex body. We give generalization bounds based on empirical margin and a novel hybrid sample compression technique. An extensive empirical evaluation shows that our method consistently outperforms a range of popular kernel embedding methods.
LGFeb 4, 2020
Apportioned Margin Approach for Cost Sensitive Large Margin ClassifiersLee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich
We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {\em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the margin between them in accordance to a given prioritization vector, which yields a tighter error bound for the important classes while also reducing the overall out-of-sample error. In addition to demonstrating an efficient implementation of our framework, we derive generalization bounds, demonstrate Fisher consistency, adapt the framework to Mercer's kernel and to neural networks, and report promising empirical results on all accounts.
STOct 7, 2019
Fast and Bayes-consistent nearest neighborsKlim Efremenko, Aryeh Kontorovich, Moshe Noivirt
Research on nearest-neighbor methods tends to focus somewhat dichotomously either on the statistical or the computational aspects -- either on, say, Bayes consistency and rates of convergence or on techniques for speeding up the proximity search. This paper aims at bridging these realms: to reap the advantages of fast evaluation time while maintaining Bayes consistency, and further without sacrificing too much in the risk decay rate. We combine the locality-sensitive hashing (LSH) technique with a novel missing-mass argument to obtain a fast and Bayes-consistent classifier. Our algorithm's prediction runtime compares favorably against state of the art approximate NN methods, while maintaining Bayes-consistency and attaining rates comparable to minimax. On samples of size $n$ in $\R^d$, our pre-processing phase has runtime $O(d n \log n)$, while the evaluation phase has runtime $O(d\log n)$ per query point.
LGJun 24, 2019
Universal Bayes consistency in metric spacesSteve Hanneke, Aryeh Kontorovich, Sivan Sabato et al.
We extend a recently proposed 1-nearest-neighbor based multiclass learning algorithm and prove that our modification is universally strongly Bayes-consistent in all metric spaces admitting any such learner, making it an "optimistically universal" Bayes-consistent learner. This is the first learning algorithm known to enjoy this property; by comparison, the $k$-NN classifier and its variants are not generally universally Bayes-consistent, except under additional structural assumptions, such as an inner product, a norm, finite dimension, or a Besicovitch-type property. The metric spaces in which universal Bayes consistency is possible are the "essentially separable" ones -- a notion that we define, which is more general than standard separability. The existence of metric spaces that are not essentially separable is widely believed to be independent of the ZFC axioms of set theory. We prove that essential separability exactly characterizes the existence of a universal Bayes-consistent learner for the given metric space. In particular, this yields the first impossibility result for universal Bayes consistency. Taken together, our results completely characterize strong and weak universal Bayes consistency in metric spaces.
LGMay 28, 2019
Efficient Kirszbraun Extension with Applications to RegressionHanan Zaichyk, Armin Biess, Aryeh Kontorovich et al.
We introduce a framework for performing regression between two Hilbert spaces. This is done based on Kirszbraun's extension theorem, to the best of our knowledge, the first application of this technique to supervised learning. We analyze the statistical and computational aspects of this method. We decompose this task into two stages: training (which corresponds operationally to smoothing/regularization) and prediction (which is achieved via Kirszbraun extension). Both are solved algorithmically via a novel multiplicative weight updates (MWU) scheme, which, for our problem formulation, achieves a quadratic runtime improvement over the state of the art. Our empirical results indicate a dramatic improvement over standard off-the-shelf solvers in our setting.
STFeb 1, 2019
Estimating the Mixing Time of Ergodic Markov ChainsGeoffrey Wolfer, Aryeh Kontorovich
We address the problem of estimating the mixing time $t_{\mathsf{mix}}$ of an arbitrary ergodic finite-state Markov chain from a single trajectory of length $m$. The reversible case was addressed by Hsu et al. [2019], who left the general case as an open problem. In the reversible case, the analysis is greatly facilitated by the fact that the Markov operator is self-adjoint, and Weyl's inequality allows for a dimension-free perturbation analysis of the empirical eigenvalues. As Hsu et al. point out, in the absence of reversibility (which induces asymmetric pair probabilities matrices), the existing perturbation analysis has a worst-case exponential dependence on the number of states $d$. Furthermore, even if an eigenvalue perturbation analysis with better dependence on $d$ were available, in the non-reversible case the connection between the spectral gap and the mixing time is not nearly as straightforward as in the reversible case. Our key insight is to estimate the pseudo-spectral gap $γ_{\mathsf{ps}}$ instead, which allows us to overcome the loss of symmetry and to achieve a polynomial dependence on the minimal stationary probability $π_\star$ and $γ_{\mathsf{ps}}$. Additionally, in the reversible case, we obtain simultaneous nearly (up to logarithmic factors) minimax rates in $t_{\mathsf{mix}}$ and precision $\varepsilon$, closing a gap in Hsu et al., who treated $\varepsilon$ as constant in the lower bounds. Finally, we construct fully empirical confidence intervals for $γ_{\mathsf{ps}}$, which shrink to zero at a rate of roughly $1/\sqrt{m}$, and improve the state of the art in even the reversible case.
MLJan 31, 2019
Minimax Testing of Identity to a Reference Ergodic Markov ChainGeoffrey Wolfer, Aryeh Kontorovich
We exhibit an efficient procedure for testing, based on a single long state sequence, whether an unknown Markov chain is identical to or $\varepsilon$-far from a given reference chain. We obtain nearly matching (up to logarithmic factors) upper and lower sample complexity bounds for our notion of distance, which is based on total variation. Perhaps surprisingly, we discover that the sample complexity depends solely on the properties of the known reference chain and does not involve the unknown chain at all, which is not even assumed to be ergodic.
LGOct 4, 2018
Improved Generalization Bounds for Adversarially Robust LearningIdan Attias, Aryeh Kontorovich, Yishay Mansour
We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{ε^4}\log(\frac{|H|}δ))$ to $O\big(\frac{1}{ε^2}(kVC(H)\log^{\frac{3}{2}+α}(kVC(H))+\log(\frac{1}δ)\big)$ for any $α> 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.
LGOct 3, 2018
Agnostic Sample Compression Schemes for RegressionIdan Attias, Steve Hanneke, Aryeh Kontorovich et al.
We obtain the first positive results for bounded sample compression in the agnostic regression setting with the $\ell_p$ loss, where $p\in [1,\infty]$. We construct a generic approximate sample compression scheme for real-valued function classes exhibiting exponential size in the fat-shattering dimension but independent of the sample size. Notably, for linear regression, an approximate compression of size linear in the dimension is constructed. Moreover, for $\ell_1$ and $\ell_\infty$ losses, we can even exhibit an efficient exact sample compression scheme of size linear in the dimension. We further show that for every other $\ell_p$ loss, $p\in (1,\infty)$, there does not exist an exact agnostic compression scheme of bounded size. This refines and generalizes a negative result of David, Moran, and Yehudayoff for the $\ell_2$ loss. We close by posing general open questions: for agnostic regression with $\ell_1$ loss, does every function class admits an exact compression scheme of size equal to its pseudo-dimension? For the $\ell_2$ loss, does every function class admit an approximate compression scheme of polynomial size in the fat-shattering dimension? These questions generalize Warmuth's classic sample compression conjecture for realizable-case classification.
MLSep 13, 2018
Statistical Estimation of Ergodic Markov Chain Kernel over Discrete State SpaceGeoffrey Wolfer, Aryeh Kontorovich
We investigate the statistical complexity of estimating the parameters of a discrete-state Markov chain kernel from a single long sequence of state observations. In the finite case, we characterize (modulo logarithmic factors) the minimax sample complexity of estimation with respect to the operator infinity norm, while in the countably infinite case, we analyze the problem with respect to a natural entry-wise norm derived from total variation. We show that in both cases, the sample complexity is governed by the mixing properties of the unknown chain, for which, in the finite-state case, there are known finite-sample estimators with fully empirical confidence intervals.
LGMay 24, 2018
Learning convex polyhedra with marginLee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich et al.
We present an improved algorithm for {\em quasi-properly} learning convex polyhedra in the realizable PAC setting from data with a margin. Our learning algorithm constructs a consistent polyhedron as an intersection of about $t \log t$ halfspaces with constant-size margins in time polynomial in $t$ (where $t$ is the number of halfspaces forming an optimal polyhedron). We also identify distinct generalizations of the notion of margin from hyperplanes to polyhedra and investigate how they relate geometrically; this result may have ramifications beyond the learning setting.
LGMay 21, 2018
Sample Compression for Real-Valued LearnersSteve Hanneke, Aryeh Kontorovich, Menachem Sadigurschi
We give an algorithmically efficient version of the learner-to-compression scheme conversion in Moran and Yehudayoff (2016). In extending this technique to real-valued hypotheses, we also obtain an efficient regression-to-bounded sample compression converter. To our knowledge, this is the first general compressed regression result (regardless of efficiency or boundedness) guaranteeing uniform approximate reconstruction. Along the way, we develop a generic procedure for constructing weak real-valued learners out of abstract regressors; this may be of independent interest. In particular, this result sheds new light on an open question of H. Simon (1997). We show applications to two regression problems: learning Lipschitz and bounded-variation functions.
LGMay 21, 2018
A New Lower Bound for Agnostic Learning with Sample Compression SchemesSteve Hanneke, Aryeh Kontorovich
We establish a tight characterization of the worst-case rates for the excess risk of agnostic learning with sample compression schemes and for uniform convergence for agnostic sample compression schemes. In particular, we find that the optimal rates of convergence for size-$k$ agnostic sample compression schemes are of the form $\sqrt{\frac{k \log(n/k)}{n}}$, which contrasts with agnostic learning with classes of VC dimension $k$, where the optimal rates are of the form $\sqrt{\frac{k}{n}}$.
CRNov 6, 2017
Advanced Analytics for Connected Cars Cyber SecurityMatan Levi, Yair Allouche, Aryeh Kontorovich
The vehicular connectivity revolution is fueling the automotive industry's most significant transformation seen in decades. However, as modern vehicles become more connected, they also become much more vulnerable to cyber-attacks. In this paper, a fully working machine learning approach is proposed to protect connected vehicles (fleets and individuals) against such attacks. We present a system that monitors different vehicle's interfaces (Network, CAN and OS), extracts relevant information based on configurable rules and sends it to a trained generative model to detect deviations from normal behavior. Using configurable data collector, we provide a higher level of data abstraction as the model is trained based on events instead of raw data, which has a noise-filtering effect and eliminates the need to retrain the model whenever a protocol changes. We present a new approach for detecting anomalies, tailored to the temporal nature of our domain. Adapting the hybrid approach of Gutflaish et al. (2017) to the fully temporal setting, we first train a Hidden Markov Model to learn normal vehicle behavior, and then a regression model to calibrate the likelihood threshold for anomaly. Using this architecture, our method detects sophisticated and realistic anomalies, which are missed by other existing methods monitoring the CAN bus only. We also demonstrate the superiority of adaptive thresholds over static ones. Furthermore, our approach scales efficiently from monitoring individual cars to serving large fleets. We demonstrate the competitive advantage of our model via encouraging empirical results.
STAug 24, 2017
Mixing time estimation in reversible Markov chains from a single sample pathDaniel Hsu, Aryeh Kontorovich, David A. Levin et al.
The spectral gap $γ$ of a finite, ergodic, and reversible Markov chain is an important parameter measuring the asymptotic rate of convergence. In applications, the transition matrix $P$ may be unknown, yet one sample of the chain up to a fixed time $n$ may be observed. We consider here the problem of estimating $γ$ from this data. Let $π$ be the stationary distribution of $P$, and $π_\star = \min_x π(x)$. We show that if $n = \tilde{O}\bigl(\frac{1}{γπ_\star}\bigr)$, then $γ$ can be estimated to within multiplicative constants with high probability. When $π$ is uniform on $d$ states, this matches (up to logarithmic correction) a lower bound of $\tildeΩ\bigl(\frac{d}γ\bigr)$ steps required for precise estimation of $γ$. Moreover, we provide the first procedure for computing a fully data-dependent interval, from a single finite-length trajectory of the chain, that traps the mixing time $t_{\text{mix}}$ of the chain at a prescribed confidence level. The interval does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{\text{relax}} = 1/γ$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $1/\sqrt{n}$ rate, where $n$ is the length of the sample path.
LGMay 23, 2017
Nearest-Neighbor Sample Compression: Efficiency, Consistency, Infinite DimensionsAryeh Kontorovich, Sivan Sabato, Roi Weiss
We examine the Bayes-consistency of a recently proposed 1-nearest-neighbor-based multiclass learning algorithm. This algorithm is derived from sample compression bounds and enjoys the statistical advantages of tight, fully empirical generalization bounds, as well as the algorithmic advantages of a faster runtime and memory savings. We prove that this algorithm is strongly Bayes-consistent in metric spaces with finite doubling dimension --- the first consistency result for an efficient nearest-neighbor sample compression scheme. Rather surprisingly, we discover that this algorithm continues to be Bayes-consistent even in a certain infinite-dimensional setting, in which the basic measure-theoretic conditions on which classic consistency proofs hinge are violated. This is all the more surprising, since it is known that $k$-NN is not Bayes-consistent in this setting. We pose several challenging open problems for future research.
LGJun 29, 2016
Exact Lower Bounds for the Agnostic Probably-Approximately-Correct (PAC) Machine Learning ModelAryeh Kontorovich, Iosif Pinelis
We provide an exact non-asymptotic lower bound on the minimax expected excess risk (EER) in the agnostic probably-ap\-proximately-correct (PAC) machine learning classification model and identify minimax learning algorithms as certain maximally symmetric and minimally randomized "voting" procedures. Based on this result, an exact asymptotic lower bound on the minimax EER is provided. This bound is of the simple form $c_\infty/\sqrtν$ as $ν\to\infty$, where $c_\infty=0.16997\dots$ is a universal constant, $ν=m/d$, $m$ is the size of the training sample, and $d$ is the Vapnik--Chervonenkis dimension of the hypothesis class. It is shown that the differences between these asymptotic and non-asymptotic bounds, as well as the differences between these two bounds and the maximum EER of any learning algorithms that minimize the empirical risk, are asymptotically negligible, and all these differences are due to ties in the mentioned "voting" procedures. A few easy to compute non-asymptotic lower bounds on the minimax EER are also obtained, which are shown to be close to the exact asymptotic lower bound $c_\infty/\sqrtν$ even for rather small values of the ratio $ν=m/d$. As an application of these results, we substantially improve existing lower bounds on the tail probability of the excess risk. Among the tools used are Bayes estimation and apparently new identities and inequalities for binomial distributions.
LGMay 22, 2016
Active Nearest-Neighbor Learning in Metric SpacesAryeh Kontorovich, Sivan Sabato, Ruth Urner
We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is significantly lower than that of any passive learner with similar error guarantees. MARMANN is based on a generalized sample compression scheme, and a new label-efficient active model-selection procedure.
LGJun 9, 2015
Mixing Time Estimation in Reversible Markov Chains from a Single Sample PathDaniel Hsu, Aryeh Kontorovich, Csaba Szepesvári
This article provides the first procedure for computing a fully data-dependent interval that traps the mixing time $t_{\text{mix}}$ of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{\text{relax}}$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $\sqrt{n}$ rate, where $n$ is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least $Ω(t_{\text{relax}})$ times on the average. Finally, future directions of research are identified.
LGFeb 22, 2015
Nearly optimal classification for semimetricsLee-Ad Gottlieb, Aryeh Kontorovich
We initiate the rigorous study of classification in semimetric spaces, which are point sets with a distance function that is non-negative and symmetric, but need not satisfy the triangle inequality. For metric spaces, the doubling dimension essentially characterizes both the runtime and sample complexity of classification algorithms --- yet we show that this is not the case for semimetrics. Instead, we define the {\em density dimension} and discover that it plays a central role in the statistical and algorithmic feasibility of learning in semimetric spaces. We present nearly optimal sample compression algorithms and use these to obtain generalization guarantees, including fast rates. The latter hold for general sample compression schemes and may be of independent interest.
LGJul 1, 2014
A Bayes consistent 1-NN classifierAryeh Kontorovich, Roi Weiss
We show that a simple modification of the 1-nearest neighbor classifier yields a strongly Bayes consistent learner. Prior to this work, the only strongly Bayes consistent proximity-based method was the k-nearest neighbor classifier, for k growing appropriately with sample size. We will argue that a margin-regularized 1-NN enjoys considerable statistical and algorithmic advantages over the k-NN classifier. These include user-friendly finite-sample error bounds, as well as time- and memory-efficient learning and test-point evaluation algorithms with a principled speed-accuracy tradeoff. Encouraging empirical results are reported.
LGApr 13, 2014
Near-optimal sample compression for nearest neighborsLee-Ad Gottlieb, Aryeh Kontorovich, Pinhas Nisnevitch
We present the first sample compression algorithm for nearest neighbors with non-trivial performance guarantees. We complement these guarantees by demonstrating almost matching hardness lower bounds, which show that our bound is nearly optimal. Our result yields new insight into margin-based nearest neighbor classification in metric spaces and allows us to significantly sharpen and simplify existing bounds. Some encouraging empirical results are also presented.
LGJan 30, 2014
Maximum Margin Multiclass Nearest NeighborsAryeh Kontorovich, Roi Weiss
We develop a general framework for margin-based multicategory classification in metric spaces. The basic work-horse is a margin-regularized version of the nearest-neighbor classifier. We prove generalization bounds that match the state of the art in sample size $n$ and significantly improve the dependence on the number of classes $k$. Our point of departure is a nearly Bayes-optimal finite-sample risk bound independent of $k$. Although $k$-free, this bound is unregularized and non-adaptive, which motivates our main result: Rademacher and scale-sensitive margin bounds with a logarithmic dependence on $k$. As the best previous risk estimates in this setting were of order $\sqrt k$, our bound is exponentially sharper. From the algorithmic standpoint, in doubling metric spaces our classifier may be trained on $n$ examples in $O(n^2\log n)$ time and evaluated on new points in $O(\log n)$ time.
PRDec 2, 2013
Consistency of weighted majority votesDaniel Berend, Aryeh Kontorovich
We revisit the classical decision-theoretic problem of weighted expert voting from a statistical learning perspective. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Nitzan-Paroush weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. The bounds we derive are nearly optimal, and several challenging open problems are posed. Experimental results are provided to illustrate the theory.
MLSep 19, 2013
Predictive PAC Learning and Process DecompositionsCosma Rohilla Shalizi, Aryeh Kontorovich
We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular ($β$-mixing) processes, of independent probability-theoretic interest.
PRSep 4, 2013
Concentration in unbounded metric spaces and algorithmic stabilityAryeh Kontorovich
We prove an extension of McDiarmid's inequality for metric spaces with unbounded diameter. To this end, we introduce the notion of the {\em subgaussian diameter}, which is a distribution-dependent refinement of the metric diameter. Our technique provides an alternative approach to that of Kutin and Niyogi's method of weakly difference-bounded functions, and yields nontrivial, dimension-free results in some interesting cases where the former does not. As an application, we give apparently the first generalization bound in the algorithmic stability setting that holds for unbounded loss functions. We furthermore extend our concentration inequality to strongly mixing processes.
LGJun 11, 2013
Efficient Classification for Metric DataLee-Ad Gottlieb, Aryeh Kontorovich, Robert Krauthgamer
Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generalization error. We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points, and can thus achieve superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is guaranteed via the fat-shattering dimension of Lipschitz classifiers, and we present experimental evidence of its superiority to some common kernel methods. As a by-product, we offer a new perspective on the nearest neighbor classifier, which yields significantly sharper risk asymptotics than the classic analysis of Cover and Hart [IEEE Trans. Info. Theory, 1967].
LGFeb 25, 2013
On learning parametric-output HMMsAryeh Kontorovich, Boaz Nadler, Roi Weiss
We present a novel approach for learning an HMM whose outputs are distributed according to a parametric family. This is done by {\em decoupling} the learning task into two steps: first estimating the output parameters, and then estimating the hidden states transition probabilities. The first step is accomplished by fitting a mixture model to the output stationary distribution. Given the parameters of this mixture model, the second step is formulated as the solution of an easily solvable convex quadratic program. We provide an error analysis for the estimated transition probabilities and show they are robust to small perturbations in the estimates of the mixture parameters. Finally, we support our analysis with some encouraging empirical results.
LGFeb 12, 2013
Adaptive Metric Dimensionality ReductionLee-Ad Gottlieb, Aryeh Kontorovich, Robert Krauthgamer
We study adaptive data-dependent dimensionality reduction in the context of supervised learning in general metric spaces. Our main statistical contribution is a generalization bound for Lipschitz functions in metric spaces that are doubling, or nearly doubling. On the algorithmic front, we describe an analogue of PCA for metric spaces: namely an efficient procedure that approximates the data's intrinsic dimension, which is often much lower than the ambient dimension. Our approach thus leverages the dual benefits of low dimensionality: (1) more efficient algorithms, e.g., for proximity search, and (2) more optimistic generalization bounds.