LGAug 31, 2022
Fine-Grained Distribution-Dependent Learning CurvesOlivier Bousquet, Steve Hanneke, Shay Moran et al. · mit
Learning curves plot the expected error of a learning algorithm as a function of the number of labeled samples it receives from a target distribution. They are widely used as a measure of an algorithm's performance, but classic PAC learning theory cannot explain their behavior. As observed by Antos and Lugosi (1996 , 1998), the classic `No Free Lunch' lower bounds only trace the upper envelope above all learning curves of specific target distributions. For a concept class with VC dimension $d$ the classic bound decays like $d/n$, yet it is possible that the learning curve for \emph{every} specific distribution decays exponentially. In this case, for each $n$ there exists a different `hard' distribution requiring $d/n$ samples. Antos and Lugosi asked which concept classes admit a `strong minimax lower bound' -- a lower bound of $d'/n$ that holds for a fixed distribution for infinitely many $n$. We solve this problem in a principled manner, by introducing a combinatorial dimension called VCL that characterizes the best $d'$ for which $d'/n$ is a strong minimax lower bound. Our characterization strengthens the lower bounds of Bousquet, Hanneke, Moran, van Handel, and Yehudayoff (2021), and it refines their theory of learning curves, by showing that for classes with finite VCL the learning rate can be decomposed into a linear component that depends only on the hypothesis class and an exponential component that depends also on the target distribution. As a corollary, we recover the lower bound of Antos and Lugosi (1996 , 1998) for half-spaces in $\mathbb{R}^d$. Finally, to provide another viewpoint on our work and how it compares to traditional PAC learning bounds, we also present an alternative formulation of our results in a language that is closer to the PAC setting.
LGJul 3, 2021
A Generalized Lottery Ticket HypothesisIbrahim Alabdulmohsin, Larisa Markeeva, Daniel Keysers et al.
We introduce a generalization to the lottery ticket hypothesis in which the notion of "sparsity" is relaxed by choosing an arbitrary basis in the space of parameters. We present evidence that the original results reported for the canonical basis continue to hold in this broader setting. We describe how structured pruning methods, including pruning units or factorizing fully-connected layers into products of low-rank matrices, can be cast as particular instances of this "generalized" lottery ticket hypothesis. The investigations reported here are preliminary and are provided to encourage further research along this direction.
CVMay 4, 2021
MLP-Mixer: An all-MLP Architecture for VisionIlya Tolstikhin, Neil Houlsby, Alexander Kolesnikov et al.
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
MLJun 18, 2020
What Do Neural Networks Learn When Trained With Random Labels?Hartmut Maennel, Ibrahim Alabdulmohsin, Ilya Tolstikhin et al.
We study deep neural networks (DNNs) trained on natural image data with entirely random labels. Despite its popularity in the literature, where it is often used to study memorization, generalization, and other phenomena, little is known about what DNNs learn in this setting. In this paper, we show analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels. We study this alignment effect by investigating neural networks pre-trained on randomly labelled image data and subsequently fine-tuned on disjoint datasets with random or real labels. We show how this alignment produces a positive transfer: networks pre-trained with random labels train faster downstream compared to training from scratch even after accounting for simple effects, such as weight scaling. We analyze how competing effects, such as specialization at later layers, may hide the positive transfer. These effects are studied in several network architectures, including VGG16 and ResNet18, on CIFAR10 and ImageNet.
MLFeb 26, 2020
Predicting Neural Network Accuracy from WeightsThomas Unterthiner, Daniel Keysers, Sylvain Gelly et al.
We show experimentally that the accuracy of a trained neural network can be predicted surprisingly well by looking only at its weights, without evaluating it on input data. We motivate this task and introduce a formal setting for it. Even when using simple statistics of the weights, the predictors are able to rank neural networks by their performance with very high accuracy (R2 score more than 0.98). Furthermore, the predictors are able to rank networks trained on different, unobserved datasets and with different architectures. We release a collection of 120k convolutional neural networks trained on four different datasets to encourage further research in this area, with the goal of understanding network training and performance better.
LGMay 28, 2019
When can unlabeled data improve the learning rate?Christina Göpfert, Shai Ben-David, Olivier Bousquet et al.
In semi-supervised classification, one is given access both to labeled and unlabeled data. As unlabeled data is typically cheaper to acquire than labeled data, this setup becomes advantageous as soon as one can exploit the unlabeled data in order to produce a better classifier than with labeled data alone. However, the conditions under which such an improvement is possible are not fully understood yet. Our analysis focuses on improvements in the minimax learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones). We argue that for such improvements to be realistic and indisputable, certain specific conditions should be satisfied and previous analyses have failed to meet those conditions. We then demonstrate examples where these conditions can be met, in particular showing rate changes from $1/\sqrt{\ell}$ to $e^{-c\ell}$ and from $1/\sqrt{\ell}$ to $1/\ell$. These results improve our understanding of what is and isn't possible in semi-supervised learning.
MLMay 27, 2019
Practical and Consistent Estimation of f-DivergencesPaul K. Rubenstein, Olivier Bousquet, Josip Djolonga et al.
The estimation of an f-divergence between two probability distributions based on samples is a fundamental problem in statistics and machine learning. Most works study this problem under very weak assumptions, in which case it is provably hard. We consider the case of stronger structural assumptions that are commonly satisfied in modern machine learning, including representation learning and generative modelling with autoencoder architectures. Under these assumptions we propose and study an estimator that can be easily implemented, works well in high dimensions, and enjoys faster rates of convergence. We verify the behavior of our estimator empirically in both synthetic and real-data experiments, and discuss its direct implications for total correlation, entropy, and mutual information estimation.
GNJan 30, 2019
GeNet: Deep Representations for MetagenomicsMateo Rojas-Carulla, Ilya Tolstikhin, Guillermo Luque et al.
We introduce GeNet, a method for shotgun metagenomic classification from raw DNA sequences that exploits the known hierarchical structure between labels for training. We provide a comparison with state-of-the-art methods Kraken and Centrifuge on datasets obtained from several sequencing technologies, in which dataset shift occurs. We show that GeNet obtains competitive precision and good recall, with orders of magnitude less memory requirements. Moreover, we show that a linear model trained on top of representations learned by GeNet achieves recall comparable to state-of-the-art methods on the aforementioned datasets, and achieves over 90% accuracy in a challenging pathogen detection problem. This provides evidence of the usefulness of the representations learned by GeNet for downstream biological tasks.
LGApr 30, 2018
Competitive Training of Mixtures of Independent Deep Generative ModelsFrancesco Locatello, Damien Vincent, Ilya Tolstikhin et al.
A common assumption in causal modeling posits that the data is generated by a set of independent mechanisms, and algorithms should aim to recover this structure. Standard unsupervised learning, however, is often concerned with training a single model to capture the overall distribution or aspects thereof. Inspired by clustering approaches, we consider mixtures of implicit generative models that ``disentangle'' the independent generative mechanisms underlying the data. Relying on an additional set of discriminators, we propose a competitive training procedure in which the models only need to capture the portion of the data distribution from which they can produce realistic samples. As a by-product, each model is simpler and faster to train. We empirically show that our approach splits the training distribution in a sensible way and increases the quality of the generated samples.
MLFeb 11, 2018
On the Latent Space of Wasserstein Auto-EncodersPaul K. Rubenstein, Bernhard Schoelkopf, Ilya Tolstikhin
We study the role of latent space dimensionality in Wasserstein auto-encoders (WAEs). Through experimentation on synthetic and real datasets, we argue that random encoders should be preferred over deterministic encoders. We highlight the potential of WAEs for representation learning with promising results on a benchmark disentanglement task.
MLNov 5, 2017
Wasserstein Auto-EncodersIlya Tolstikhin, Olivier Bousquet, Sylvain Gelly et al.
We propose the Wasserstein Auto-Encoder (WAE)---a new algorithm for building a generative model of the data distribution. WAE minimizes a penalized form of the Wasserstein distance between the model distribution and the target distribution, which leads to a different regularizer than the one used by the Variational Auto-Encoder (VAE). This regularizer encourages the encoded training distribution to match the prior. We compare our algorithm with several other techniques and show that it is a generalization of adversarial auto-encoders (AAE). Our experiments show that WAE shares many of the properties of VAEs (stable training, encoder-decoder architecture, nice latent manifold structure) while generating samples of better quality, as measured by the FID score.
MLOct 4, 2017
Differentially Private Database Release via Kernel Mean EmbeddingsMatej Balog, Ilya Tolstikhin, Bernhard Schölkopf
We lay theoretical foundations for new database release mechanisms that allow third-parties to construct consistent estimators of population statistics, while ensuring that the privacy of each individual contributing to the database is protected. The proposed framework rests on two main ideas. First, releasing (an estimate of) the kernel mean embedding of the data generating random variable instead of the database itself still allows third-parties to construct consistent estimators of a wide class of population statistics. Second, the algorithm can satisfy the definition of differential privacy by basing the released kernel mean embedding on entirely synthetic data points, while controlling accuracy through the metric available in a Reproducing Kernel Hilbert Space. We describe two instantiations of the proposed framework, suitable under different scenarios, and prove theoretical results guaranteeing differential privacy of the resulting algorithms and the consistency of estimators constructed from their outputs.
MLJun 30, 2017
Probabilistic Active Learning of Functions in Structural Causal ModelsPaul K. Rubenstein, Ilya Tolstikhin, Philipp Hennig et al.
We consider the problem of learning the functions computing children from parents in a Structural Causal Model once the underlying causal graph has been identified. This is in some sense the second step after causal discovery. Taking a probabilistic approach to estimating these functions, we derive a natural myopic active learning scheme that identifies the intervention which is optimally informative about all of the unknown functions jointly, given previously observed data. We test the derived algorithms on simple examples, to demonstrate that they produce a structured exploration policy that significantly improves on unstructured base-lines.
MLMay 22, 2017
From optimal transport to generative modeling: the VEGAN cookbookOlivier Bousquet, Sylvain Gelly, Ilya Tolstikhin et al.
We study unsupervised generative modeling in terms of the optimal transport (OT) problem between true (but unknown) data distribution $P_X$ and the latent variable model distribution $P_G$. We show that the OT problem can be equivalently written in terms of probabilistic encoders, which are constrained to match the posterior and prior distributions over the latent space. When relaxed, this constrained optimization problem leads to a penalized optimal transport (POT) objective, which can be efficiently minimized using stochastic gradient descent by sampling from $P_X$ and $P_G$. We show that POT for the 2-Wasserstein distance coincides with the objective heuristically employed in adversarial auto-encoders (AAE) (Makhzani et al., 2016), which provides the first theoretical justification for AAEs known to the authors. We also compare POT to other popular techniques like variational auto-encoders (VAE) (Kingma and Welling, 2014). Our theoretical results include (a) a better understanding of the commonly observed blurriness of images generated by VAEs, and (b) establishing duality between Wasserstein GAN (Arjovsky and Bottou, 2017) and POT for the 1-Wasserstein distance.
MLJan 9, 2017
AdaGAN: Boosting Generative ModelsIlya Tolstikhin, Sylvain Gelly, Olivier Bousquet et al.
Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are an effective method for training generative models of complex data such as natural images. However, they are notoriously hard to train and can suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. We propose an iterative procedure, called AdaGAN, where at every step we add a new component into a mixture model by running a GAN algorithm on a reweighted sample. This is inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. We prove that such an incremental procedure leads to convergence to the true distribution in a finite number of steps if each step is optimal, and convergence at an exponential rate otherwise. We also illustrate experimentally that this procedure addresses the problem of missing modes.
MLOct 19, 2016
Consistent Kernel Mean Estimation for Functions of Random VariablesCarl-Johann Simon-Gabriel, Adam Ścibior, Ilya Tolstikhin et al.
We provide a theoretical foundation for non-parametric estimation of functions of random variables using kernel mean embeddings. We show that for any continuous function $f$, consistent estimators of the mean embedding of a random variable $X$ lead to consistent estimators of the mean embedding of $f(X)$. For Matérn kernels and sufficiently smooth functions we also provide rates of convergence. Our results extend to functions of multiple random variables. If the variables are dependent, we require an estimator of the mean embedding of their joint distribution as a starting point; if they are independent, it is sufficient to have separate estimators of the mean embeddings of their marginal distributions. In either case, our results cover both mean embeddings based on i.i.d. samples as well as "reduced set" expansions in terms of dependent expansion points. The latter serves as a justification for using such expansions to limit memory resources when applying the approach as a basis for probabilistic programming.
MLFeb 9, 2016
Minimax Lower Bounds for Realizable Transductive ClassificationIlya Tolstikhin, David Lopez-Paz
Transductive learning considers a training set of $m$ labeled samples and a test set of $u$ unlabeled samples, with the goal of best labeling that particular test set. Conversely, inductive learning considers a training set of $m$ labeled samples drawn iid from $P(X,Y)$, with the goal of best labeling any future samples drawn iid from $P(X)$. This comparison suggests that transduction is a much easier type of inference than induction, but is this really the case? This paper provides a negative answer to this question, by proving the first known minimax lower bounds for transductive, realizable, binary classification. Our lower bounds show that $m$ should be at least $Ω(d/ε+ \log(1/δ)/ε)$ when $ε$-learning a concept class $\mathcal{H}$ of finite VC-dimension $d<\infty$ with confidence $1-δ$, for all $m \leq u$. This result draws three important conclusions. First, general transduction is as hard as general induction, since both problems have $Ω(d/m)$ minimax values. Second, the use of unlabeled data does not help general transduction, since supervised learning algorithms such as ERM and (Hanneke, 2015) match our transductive lower bounds while ignoring the unlabeled test set. Third, our transductive lower bounds imply lower bounds for semi-supervised learning, which add to the important discussion about the role of unlabeled data in machine learning.
MLMay 12, 2015
Permutational Rademacher Complexity: a New Complexity Measure for Transductive LearningIlya Tolstikhin, Nikita Zhivotovskiy, Gilles Blanchard
Transductive learning considers situations when a learner observes $m$ labelled training points and $u$ unlabelled test points with the final goal of giving correct answers for the test points. This paper introduces a new complexity measure for transductive learning called Permutational Rademacher Complexity (PRC) and studies its properties. A novel symmetrization inequality is proved, which shows that PRC provides a tighter control over expected suprema of empirical processes compared to what happens in the standard i.i.d. setting. A number of comparison results are also provided, which show the relation between PRC and other popular complexity measures used in statistical learning theory, including Rademacher complexity and Transductive Rademacher Complexity (TRC). We argue that PRC is a more suitable complexity measure for transductive learning. Finally, these results are combined with a standard concentration argument to provide novel data-dependent risk bounds for transductive learning.
MLFeb 9, 2015
Towards a Learning Theory of Cause-Effect InferenceDavid Lopez-Paz, Krikamol Muandet, Bernhard Schölkopf et al.
We pose causal inference as the problem of learning to classify probability distributions. In particular, we assume access to a collection $\{(S_i,l_i)\}_{i=1}^n$, where each $S_i$ is a sample drawn from the probability distribution of $X_i \times Y_i$, and $l_i$ is a binary label indicating whether "$X_i \to Y_i$" or "$X_i \leftarrow Y_i$". Given these data, we build a causal inference rule in two steps. First, we featurize each $S_i$ using the kernel mean embedding associated with some characteristic kernel. Second, we train a binary classifier on such embeddings to distinguish between causal directions. We present generalization bounds showing the statistical consistency and learning rates of the proposed approach, and provide a simple implementation that achieves state-of-the-art cause-effect inference. Furthermore, we extend our ideas to infer causal relationships between more than two variables.
MLNov 26, 2014
Localized Complexities for Transductive LearningIlya Tolstikhin, Gilles Blanchard, Marius Kloft
We show two novel concentration inequalities for suprema of empirical processes when sampling without replacement, which both take the variance of the functions into account. While these inequalities may potentially have broad applications in learning theory in general, we exemplify their significance by studying the transductive setting of learning theory. For which we provide the first excess risk bounds based on the localized complexity of the hypothesis class, which can yield fast rates of convergence also in the transductive learning setting. We give a preliminary analysis of the localized complexities for the prominent case of kernel classes.