Robert C. Williamson

LG
h-index10
38papers
933citations
Novelty50%
AI Score43

38 Papers

LGJun 7, 2022
Risk Measures and Upper Probabilities: Coherence and Stratification

Christian Fröhlich, Robert C. Williamson

Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of alternative aggregation functionals, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we arrive at a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.

LGJun 26, 2023
Insights From Insurance for Fair Machine Learning

Christian Fröhlich, Robert C. Williamson

We argue that insurance can act as an analogon for the social situatedness of machine learning systems, hence allowing machine learning scholars to take insights from the rich and interdisciplinary insurance literature. Tracing the interaction of uncertainty, fairness and responsibility in insurance provides a fresh perspective on fairness in machine learning. We link insurance fairness conceptions to their machine learning relatives, and use this bridge to problematize fairness as calibration. In this process, we bring to the forefront two themes that have been largely overlooked in the machine learning literature: responsibility and aggregate-individual tensions.

LGSep 1, 2022
The Geometry and Calculus of Losses

Robert C. Williamson, Zac Cranko

Statistical decision problems lie at the heart of statistical machine learning. The simplest problems are binary and multiclass classification and class probability estimation. Central to their definition is the choice of loss function, which is the means by which the quality of a solution is evaluated. In this paper we systematically develop the theory of loss functions for such problems from a novel perspective whose basic ingredients are convex sets with a particular structure. The loss function is defined as the subgradient of the support function of the convex set. It is consequently automatically proper (calibrated for probability estimation). This perspective provides three novel opportunities. It enables the development of a fundamental relationship between losses and (anti)-norms that appears to have not been noticed before. Second, it enables the development of a calculus of losses induced by the calculus of convex sets which allows the interpolation between different losses, and thus is a potential useful design tool for tailoring losses to particular problems. In doing this we build upon, and considerably extend existing results on $M$-sums of convex sets. Third, the perspective leads to a natural theory of ``polar'' loss functions, which are derived from the polar dual of the convex set defining the loss, and which form a natural universal substitution function for Vovk's aggregating algorithm.

LGMay 19, 2022
What killed the Convex Booster ?

Yishay Mansour, Richard Nock, Robert C. Williamson

A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: \textit{parameterisation}.

LGAug 5, 2022
Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity

Christian Fröhlich, Robert C. Williamson

Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.

LGJul 25, 2022
Information Processing Equalities and the Information-Risk Bridge

Robert C. Williamson, Zac Cranko

We introduce two new classes of measures of information for statistical experiments which generalise and subsume $φ$-divergences, integral probability metrics, $\mathfrak{N}$-distances (MMD), and $(f,Γ)$ divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational $φ$-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical data processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization.

LGFeb 23, 2023
The Geometry of Mixability

Armando J. Cabrera Pacheco, Robert C. Williamson

Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function $\ell$ is $η$-mixable if and only if the superpredition set $\textrm{spr}(η\ell)$ of the scaled loss function $η\ell$ slides freely inside the superprediction set $\textrm{spr}(\ell_{\log})$ of the log loss $\ell_{\log}$, under fairly general assumptions on the differentiability of $\ell$. Our approach provides a way to treat some concepts concerning loss functions (like properness) in a ''coordinate-free'' manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.

LGJul 27, 2022
Fairness and Randomness in Machine Learning: Statistical Independence and Relativization

Rabanus Derr, Robert C. Williamson

Fair Machine Learning endeavors to prevent unfairness arising in the context of machine learning applications embedded in society. Despite the variety of definitions of fairness and proposed "fair algorithms", there remain unresolved conceptual problems regarding fairness. In this paper, we dissect the role of statistical independence in fairness and randomness notions regularly used in machine learning. Thereby, we are led to a suprising hypothesis: randomness and fairness can be considered equivalent concepts in machine learning. In particular, we obtain a relativized notion of randomness expressed as statistical independence by appealing to Von Mises' century-old foundations for probability. This notion turns out to be "orthogonal" in an abstract sense to the commonly used i.i.d.-randomness. Using standard fairness notions in machine learning, which are defined via statistical independence, we then link the ex ante randomness assumptions about the data to the ex post requirements for fair predictions. This connection proves fruitful: we use it to argue that randomness and fairness are essentially relative and that both concepts should reflect their nature as modeling assumptions in machine learning.

LGJul 17, 2023
Corruptions of Supervised Learning Problems: Typology and Mitigations

Laura Iacovissi, Nan Lu, Robert C. Williamson

Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature predominantly focuses on specific settings and learning scenarios, lacking a unified view of corruption modelization and mitigation. In this work, we develop a general theory of corruption, which incorporates all modifications to a supervised learning problem, including changes in model class and loss. Focusing on changes to the underlying probability distributions via Markov kernels, our approach leads to three novel opportunities. First, it enables the construction of a novel, provably exhaustive corruption framework, distinguishing among different corruption types. This serves to unify existing models and establish a consistent nomenclature. Second, it facilitates a systematic analysis of corruption's consequences on learning tasks, by comparing Bayes risks in the clean and corrupted scenarios. Notably, while label corruptions affect only the loss function, attribute corruptions additionally influence the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand existing loss-correction methods for label corruption to handle dependent corruption types. Our findings highlight the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements to encompass more corruption types. We provide such a paradigm as well as loss correction formulas in the attribute and joint corruption cases.

CLJul 8, 2024
Limits to Predicting Online Speech Using Large Language Models

Mina Remeli, Moritz Hardt, Robert C. Williamson

Our paper studies the predictability of online speech -- that is, how well language models learn to model the distribution of user generated content on X (previously Twitter). We define predictability as a measure of the model's uncertainty, i.e. its negative log-likelihood. As the basis of our study, we collect 10M tweets for ``tweet-tuning'' base models and a further 6.25M posts from more than five thousand X (previously Twitter) users and their peers. In our study involving more than 5000 subjects, we find that predicting posts of individual users remains surprisingly hard. Moreover, it matters greatly what context is used: models using the users' own history significantly outperform models using posts from their social circle. We validate these results across four large language models ranging in size from 1.5 billion to 70 billion parameters. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on it. We follow up with a detailed investigation on what is learned in-context and a demographic analysis. Up to 20\% of what is learned in-context is the use of @-mentions and hashtags. Our main results hold across the demographic groups we studied.

LGJul 24, 2024
We should avoid the assumption of data-generating probability distributions in social settings

Benedikt Höltgen, Robert C. Williamson

Machine Learning research, including work promoting fair or equitable algorithms, heavily relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are 'sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which are also drawn from it. We argue, however, that such true probability distributions do not exist and should not be dealt with uncritically. We show that alternative frameworks focusing directly on relevant populations rather than abstract distributions are available and leave classical learning theory almost unchanged. Furthermore, we argue that the assumption of true probabilities or data-generating distributions can be misleading and obscure both the choices made and the goals pursued in machine learning practice. Based on these considerations, this position paper argues that, at least in social settings, machine learning work should avoid assuming data-generating probability distributions.

MEJul 24, 2024
Formalising causal inference as prediction on a target population

Benedikt Höltgen, Robert C. Williamson

The standard approach to causal modelling especially in social and health sciences is the potential outcomes framework due to Neyman and Rubin. In this framework, observations are thought to be drawn from a distribution over variables of interest, and the goal is to identify parameters of this distribution. Even though the stated goal is often to inform decision making on some target population, there is no straightforward way to include these target populations in the framework. Instead of modelling the relationship between the observed sample and the target population, the inductive assumptions in this framework take the form of abstract sampling and independence assumptions. In this paper, we develop a version of this framework that construes causal inference as treatment-wise predictions for finite populations where all assumptions are testable in retrospect; this means that one can not only test predictions themselves (without any fundamental problem) but also investigate sources of error when they fail. Due to close connections to the original framework, established methods can still be be analysed under the new framework.

LGApr 8
The Rhetoric of Machine Learning

Robert C. Williamson

I examine the technology of machine learning from the perspective of rhetoric, which is simply the art of persuasion. Rather than being a neutral and "objective" way to build "world models" from data, machine learning is (I argue) inherently rhetorical. I explore some of its rhetorical features, and examine one pervasive business model where machine learning is widely used, "manipulation as a service."

LGOct 30, 2024
Scoring Rules and Calibration for Imprecise Probabilities

Christian Fröhlich, Robert C. Williamson

What does it mean to say that, for example, the probability for rain tomorrow is between 20% and 30%? The theory for the evaluation of precise probabilistic forecasts is well-developed and is grounded in the key concepts of proper scoring rules and calibration. For the case of imprecise probabilistic forecasts (sets of probabilities), such theory is still lacking. In this work, we therefore generalize proper scoring rules and calibration to the imprecise case. We develop these concepts as relative to data models and decision problems. As a consequence, the imprecision is embedded in a clear context. We establish a close link to the paradigm of (group) distributional robustness and in doing so provide new insights for it. We argue that proper scoring rules and calibration serve two distinct goals, which are aligned in the precise case, but intriguingly are not necessarily aligned in the imprecise case. The concept of decision-theoretic entropy plays a key role for both goals. Finally, we demonstrate the theoretical insights in machine learning practice, in particular we illustrate subtle pitfalls relating to the choice of loss function in distributional robustness.

LGApr 25, 2025
Three Types of Calibration with Properties and their Semantic and Formal Relationships

Rabanus Derr, Jessie Finocchiaro, Robert C. Williamson

Fueled by discussions around "trustworthiness" and algorithmic fairness, calibration of predictive systems has regained scholars attention. The vanilla definition and understanding of calibration is, simply put, on all days on which the rain probability has been predicted to be p, the actual frequency of rain days was p. However, the increased attention has led to an immense variety of new notions of "calibration." Some of the notions are incomparable, serve different purposes, or imply each other. In this work, we provide two accounts which motivate calibration: self-realization of forecasted properties and precise estimation of incurred losses of the decision makers relying on forecasts. We substantiate the former via the reflection principle and the latter by actuarial fairness. For both accounts we formulate prototypical definitions via properties $Γ$ of outcome distributions, e.g., the mean or median. The prototypical definition for self-realization, which we call $Γ$-calibration, is equivalent to a certain type of swap regret under certain conditions. These implications are strongly connected to the omniprediction learning paradigm. The prototypical definition for precise loss estimation is a modification of decision calibration adopted from Zhao et al. [73]. For binary outcome sets both prototypical definitions coincide under appropriate choices of reference properties. For higher-dimensional outcome sets, both prototypical definitions can be subsumed by a natural extension of the binary definition, called distribution calibration with respect to a property. We conclude by commenting on the role of groupings in both accounts of calibration often used to obtain multicalibration. In sum, this work provides a semantic map of calibration in order to navigate a fragmented terrain of notions and definitions.

LGJun 4, 2024
An Axiomatic Approach to Loss Aggregation and an Adapted Aggregating Algorithm

Armando J. Cabrera Pacheco, Rabanus Derr, Robert C. Williamson

Supervised learning has gone beyond the expected risk minimization framework. Central to most of these developments is the introduction of more general aggregation functions for losses incurred by the learner. In this paper, we turn towards online learning under expert advice. Via easily justified assumptions we characterize a set of reasonable loss aggregation functions as quasi-sums. Based upon this insight, we suggest a variant of the Aggregating Algorithm tailored to these more general aggregation functions. This variant inherits most of the nice theoretical properties of the AA, such as recovery of Bayes' updating and a time-independent bound on quasi-sum regret. Finally, we argue that generalized aggregations express the attitude of the learner towards losses.

LGMar 4, 2024
Geometry and Stability of Supervised Learning Problems

Facundo Mémoli, Brantley Vose, Robert C. Williamson

We introduce a notion of distance between supervised learning problems, which we call the Risk distance. This distance, inspired by optimal transport, facilitates stability results; one can quantify how seriously issues like sampling bias, noise, limited data, and approximations might change a given problem by bounding how much these modifications can move the problem under the Risk distance. With the distance established, we explore the geometry of the resulting space of supervised learning problems, providing explicit geodesics and proving that the set of classification problems is dense in a larger class of problems. We also provide two variants of the Risk distance: one that incorporates specified weights on a problem's predictors, and one that is more sensitive to the contours of a problem's risk landscape.

LGJan 25, 2024
Forecast Evaluation and the Relationship of Regret and Calibration

Rabanus Derr, Robert C. Williamson

Machine learning is about forecasting. When the forecasts come with an evaluation metric the forecasts become useful. What are reasonable evaluation metrics? How do existing evaluation metrics relate? In this work, we provide a general structure which subsumes many currently used evaluation metrics in a two-dimensional hierarchy, e.g., external and swap regret, loss scores, and calibration scores. The framework embeds those evaluation metrics in a large set of single-instance-based comparisons of forecasts and observations which respect a meta-criterion for reasonable forecast evaluations which we term ``fairness''. In particular, this framework sheds light on the relationship on regret-type and calibration-type evaluation metrics showing a theoretical equivalence in their ability to evaluate, but practical incomparability of the obtained scores.

LGJun 26, 2020
PAC-Bayesian Bound for the Conditional Value at Risk

Zakaria Mhammedi, Benjamin Guedj, Robert C. Williamson

Conditional Value at Risk (CVaR) is a family of "coherent risk measures" which generalize the traditional mathematical expectation. Widely used in mathematical finance, it is garnering increasing interest in machine learning, e.g., as an alternate approach to regularization, and as a means for ensuring fairness. This paper presents a generalization bound for learning algorithms that minimize the CVaR of the empirical loss. The bound is of PAC-Bayesian type and is guaranteed to be small when the empirical CVaR is small. We achieve this by reducing the problem of estimating CVaR to that of merely estimating an expectation. This then enables us, as a by-product, to obtain concentration inequalities for CVaR even when the random variable in question is unbounded.

LGFeb 19, 2019
Proper-Composite Loss Functions in Arbitrary Dimensions

Zac Cranko, Robert C. Williamson, Richard Nock

The study of a machine learning problem is in many ways is difficult to separate from the study of the loss function being used. One avenue of inquiry has been to look at these loss functions in terms of their properties as scoring rules via the proper-composite representation, in which predictions are mapped to probability distributions which are then scored via a scoring rule. However, recent research so far has primarily been concerned with analysing the (typically) finite-dimensional conditional risk problem on the output space, leaving aside the larger total risk minimisation. We generalise a number of these results to an infinite dimensional setting and in doing so we are able to exploit the familial resemblance of density and conditional density estimation to provide a simple characterisation of the canonical link.

MLFeb 3, 2019
Adversarial Networks and Autoencoders: The Primal-Dual Relationship and Generalization Bounds

Hisham Husain, Richard Nock, Robert C. Williamson

Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder (WAE) was proposed, achieving performance similar to that of GANs, yet it is still unclear whether the two are fundamentally different or can be further improved into a unified model. In this work, we study the $f$-GAN and WAE models and make two main discoveries. First, we find that the $f$-GAN and WAE objectives partake in a primal-dual relationship and are equivalent under some assumptions, which then allows us to explicate the success of WAE. Second, the equivalence result allows us to, for the first time, prove generalization bounds for Autoencoder models, which is a pertinent problem when it comes to theoretical analyses of generative models. Furthermore, we show that the WAE objective is related to other statistical quantities such as the $f$-divergence and in particular, upper bounded by the Wasserstein distance, which then allows us to tap into existing efficient (regularized) optimal transport solvers. Our findings thus present the first primal-dual relationship between GANs and Autoencoder models, comment on generalization abilities and make a step towards unifying these models.

LGJan 24, 2019
Fairness risk measures

Robert C. Williamson, Aditya Krishna Menon

Ensuring that classifiers are non-discriminatory or fair with respect to a sensitive feature (e.g., race or gender) is a topical problem. Progress in this task requires fixing a definition of fairness, and there have been several proposals in this regard over the past few years. Several of these, however, assume either binary sensitive features (thus precluding categorical or real-valued sensitive groups), or result in non-convex objectives (thus adversely affecting the optimisation landscape). In this paper, we propose a new definition of fairness that generalises some existing proposals, while allowing for generic sensitive features and resulting in a convex objective. The key idea is to enforce that the expected losses (or risks) across each subgroup induced by the sensitive feature are commensurate. We show how this relates to the rich literature on risk measures from mathematical finance. As a special case, this leads to a new convex fairness-aware objective based on minimising the conditional value at risk (CVaR).

LGMay 20, 2018
Exp-Concavity of Proper Composite Losses

Parameswaran Kamalaruban, Robert C. Williamson, Xinhua Zhang

The goal of online prediction with expert advice is to find a decision strategy which will perform almost as well as the best expert in a given pool of experts, on any sequence of outcomes. This problem has been widely studied and $O(\sqrt{T})$ and $O(\log{T})$ regret bounds can be achieved for convex losses (\cite{zinkevich2003online}) and strictly convex losses with bounded first and second derivatives (\cite{hazan2007logarithmic}) respectively. In special cases like the Aggregating Algorithm (\cite{vovk1995game}) with mixable losses and the Weighted Average Algorithm (\cite{kivinen1999averaging}) with exp-concave losses, it is possible to achieve $O(1)$ regret bounds. \cite{van2012exp} has argued that mixability and exp-concavity are roughly equivalent under certain conditions. Thus by understanding the underlying relationship between these two notions we can gain the best of both algorithms (strong theoretical performance guarantees of the Aggregating Algorithm and the computational efficiency of the Weighted Average Algorithm). In this paper we provide a complete characterization of the exp-concavity of any proper composite loss. Using this characterization and the mixability condition of proper losses (\cite{van2012mixability}), we show that it is possible to transform (re-parameterize) any $β$-mixable binary proper loss into a $β$-exp-concave composite loss with the same $β$. In the multi-class case, we propose an approximation approach for this transformation.

LGMay 20, 2018
Minimax Lower Bounds for Cost Sensitive Classification

Parameswaran Kamalaruban, Robert C. Williamson

The cost-sensitive classification problem plays a crucial role in mission-critical machine learning applications, and differs with traditional classification by taking the misclassification costs into consideration. Although being studied extensively in the literature, the fundamental limits of this problem are still not well understood. We investigate the hardness of this problem by extending the standard minimax lower bound of balanced binary classification problem (due to \cite{massart2006risk}), and emphasize the impact of cost terms on the hardness.

LGFeb 20, 2018
Constant Regret, Generalized Mixability, and Mirror Descent

Zakaria Mhammedi, Robert C. Williamson

We consider the setting of prediction with expert advice; a learner makes predictions by aggregating those of a group of experts. Under this setting, and for the right choice of loss function and "mixing" algorithm, it is possible for the learner to achieve a constant regret regardless of the number of prediction rounds. For example, a constant regret can be achieved for \emph{mixable} losses using the \emph{aggregating algorithm}. The \emph{Generalized Aggregating Algorithm} (GAA) is a name for a family of algorithms parameterized by convex functions on simplices (entropies), which reduce to the aggregating algorithm when using the \emph{Shannon entropy} $\operatorname{S}$. For a given entropy $Φ$, losses for which a constant regret is possible using the \textsc{GAA} are called $Φ$-mixable. Which losses are $Φ$-mixable was previously left as an open question. We fully characterize $Φ$-mixability and answer other open questions posed by \cite{Reid2015}. We show that the Shannon entropy $\operatorname{S}$ is fundamental in nature when it comes to mixability; any $Φ$-mixable loss is necessarily $\operatorname{S}$-mixable, and the lowest worst-case regret of the \textsc{GAA} is achieved using the Shannon entropy. Finally, by leveraging the connection between the \emph{mirror descent algorithm} and the update step of the GAA, we suggest a new \emph{adaptive} generalized aggregating algorithm and analyze its performance in terms of the regret bound.

LGOct 12, 2017
Provably Fair Representations

Daniel McNamara, Cheng Soon Ong, Robert C. Williamson

Machine learning systems are increasingly used to make decisions about people's lives, such as whether to give someone a loan or whether to interview someone for a job. This has led to considerable interest in making such machine learning systems fair. One approach is to transform the input data used by the algorithm. This can be achieved by passing each input data point through a representation function prior to its use in training or testing. Techniques for learning such representation functions from data have been successful empirically, but typically lack theoretical fairness guarantees. We show that it is possible to prove that a representation function is fair according to common measures of both group and individual fairness, as well as useful with respect to a target task. These provable properties can be used in a governance model involving a data producer, a data user and a data regulator, where there is a separation of concerns between fairness and target task utility to ensure transparency and prevent perverse incentives. We formally define the 'cost of mistrust' of using this model compared to the setting where there is a single trusted party, and provide bounds on this cost in particular cases. We present a practical approach to learning fair representation functions and apply it to financial and criminal justice datasets. We evaluate the fairness and utility of these representation functions using measures motivated by our theoretical results.

LGJul 14, 2017
f-GANs in an Information Geometric Nutshell

Richard Nock, Zac Cranko, Aditya Krishna Menon et al.

Nowozin \textit{et al} showed last year how to extend the GAN \textit{principle} to all $f$-divergences. The approach is elegant but falls short of a full description of the supervised game, and says little about the key player, the generator: for example, what does the generator actually converge to if solving the GAN game means convergence in some space of parameters? How does that provide hints on the generator's design and compare to the flourishing but almost exclusively experimental literature on the subject? In this paper, we unveil a broad class of distributions for which such convergence happens --- namely, deformed exponential families, a wide superset of exponential families --- and show tight connections with the three other key GAN parameters: loss, game and architecture. In particular, we show that current deep architectures are able to factorize a very large number of such densities using an especially compact design, hence displaying the power of deep architectures and their concinnity in the $f$-GAN game. This result holds given a sufficient condition on \textit{activation functions} --- which turns out to be satisfied by popular choices. The key to our results is a variational generalization of an old theorem that relates the KL divergence between regular exponential families and divergences between their natural parameters. We complete this picture with additional results and experimental insights on how these results may be used to ground further improvements of GAN architectures, via (i) a principled design of the activation functions in the generator and (ii) an explicit integration of proper composite losses' link function in the discriminator.

LGMay 25, 2017
The cost of fairness in classification

Aditya Krishna Menon, Robert C. Williamson

We study the problem of learning classifiers with a fairness constraint, with three main contributions towards the goal of quantifying the problem's inherent tradeoffs. First, we relate two existing fairness measures to cost-sensitive risks. Second, we show that for cost-sensitive classification and fairness measures, the optimal classifier is an instance-dependent thresholding of the class-probability function. Third, we show how the tradeoff between accuracy and fairness is determined by the alignment between the class-probabilities for the target and sensitive features. Underpinning our analysis is a general framework that casts the problem of learning with a fairness requirement as one of minimising the difference of two statistical risks.

LGNov 9, 2016
A Modular Theory of Feature Learning

Daniel McNamara, Cheng Soon Ong, Robert C. Williamson

Learning representations of data, and in particular learning features for a subsequent prediction task, has been a fruitful area of research delivering impressive empirical results in recent years. However, relatively little is understood about what makes a representation `good'. We propose the idea of a risk gap induced by representation learning for a given prediction context, which measures the difference in the risk of some learner using the learned features as compared to the original inputs. We describe a set of sufficient conditions for unsupervised representation learning to provide a benefit, as measured by this risk gap. These conditions decompose the problem of when representation learning works into its constituent parts, which can be separately evaluated using an unlabeled sample, suitable domain-specific assumptions about the joint distribution, and analysis of the feature learner and subsequent supervised learner. We provide two examples of such conditions in the context of specific properties of the unlabeled distribution, namely when the data lies close to a low-dimensional manifold and when it forms clusters. We compare our approach to a recently proposed analysis of semi-supervised learning.

LGJul 9, 2015
Fast rates in statistical and online learning

Tim van Erven, Peter D. Grünwald, Nishant A. Mehta et al.

The speed with which a learning algorithm converges as it is presented with more data is a central problem in machine learning --- a fast rate of convergence means less data is needed for the same level of performance. The pursuit of fast rates in online and statistical learning has led to the discovery of many conditions in learning theory under which fast learning is possible. We show that most of these conditions are special cases of a single, unifying condition, that comes in two forms: the central condition for 'proper' learning algorithms that always output a hypothesis in the given model, and stochastic mixability for online algorithms that may make predictions outside of the model. We show that under surprisingly weak assumptions both conditions are, in a certain sense, equivalent. The central condition has a re-interpretation in terms of convexity of a set of pseudoprobabilities, linking it to density estimation under misspecification. For bounded losses, we show how the central condition enables a direct proof of fast rates and we prove its equivalence to the Bernstein condition, itself a generalization of the Tsybakov margin condition, both of which have played a central role in obtaining fast rates in statistical learning. Yet, while the Bernstein condition is two-sided, the central condition is one-sided, making it more suitable to deal with unbounded losses. In its stochastic mixability form, our condition generalizes both a stochastic exp-concavity condition identified by Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying conditions thus provide a substantial step towards a characterization of fast rates in statistical learning, similar to how classical mixability characterizes constant regret in the sequential prediction with expert advice setting.

MLJun 4, 2015
Sparse Robust Classification via the Kernel Mean

Brendan van Rooyen, Aditya Krishna Menon, Robert C. Williamson

Many leading classification algorithms output a classifier that is a weighted average of kernel evaluations. Optimizing these weights is a nontrivial problem that still attracts much research effort. Furthermore, explaining these methods to the uninitiated is a difficult task. Letting all the weights be equal leads to a conceptually simpler classification rule, one that requires little effort to motivate or explain, the mean. Here we explore the consistency, robustness and sparsification of this simple classification rule.

LGMay 28, 2015
Learning with Symmetric Label Noise: The Importance of Being Unhinged

Brendan van Rooyen, Aditya Krishna Menon, Robert C. Williamson

Convex potential minimisation is the de facto approach to binary classification. However, Long and Servedio [2010] proved that under symmetric label noise (SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly shows that convex losses are not SLN-robust. In this paper, we propose a convex, classification-calibrated loss and prove that it is SLN-robust. The loss avoids the Long and Servedio [2010] result by virtue of being negatively unbounded. The loss is a modification of the hinge loss, where one does not clamp at zero; hence, we call it the unhinged loss. We show that the optimal unhinged solution is equivalent to that of a strongly regularised SVM, and is the limiting solution for any convex potential; this implies that strong l2 regularisation makes most standard learners SLN-robust. Experiments confirm the SLN-robustness of the unhinged loss.

MLApr 1, 2015
Learning in the Presence of Corruption

Brendan van Rooyen, Robert C. Williamson

In supervised learning one wishes to identify a pattern present in a joint distribution $P$, of instances, label pairs, by providing a function $f$ from instances to labels that has low risk $\mathbb{E}_{P}\ell(y,f(x))$. To do so, the learner is given access to $n$ iid samples drawn from $P$. In many real world problems clean samples are not available. Rather, the learner is given access to samples from a corrupted distribution $\tilde{P}$ from which to learn, while the goal of predicting the clean pattern remains. There are many different types of corruption one can consider, and as of yet there is no general means to compare the relative ease of learning under these different corruption processes. In this paper we develop a general framework for tackling such problems as well as introducing upper and lower bounds on the risk for learning in the presence of corruption. Our ultimate goal is to be able to make informed economic decisions in regards to the acquisition of data sets. For a certain subclass of corruption processes (those that are \emph{reconstructible}) we achieve this goal in a particular sense. Our lower bounds are in terms of the coefficient of ergodicity, a simple to calculate property of stochastic matrices. Our upper bounds proceed via a generalization of the method of unbiased estimators appearing in recent work of Natarajan et al and implicit in the earlier work of Kearns.

MLApr 1, 2015
A Theory of Feature Learning

Brendan van Rooyen, Robert C. Williamson

Feature Learning aims to extract relevant information contained in data sets in an automated fashion. It is driving force behind the current deep learning trend, a set of methods that have had widespread empirical success. What is lacking is a theoretical understanding of different feature learning schemes. This work provides a theoretical framework for feature learning and then characterizes when features can be learnt in an unsupervised fashion. We also provide means to judge the quality of features via rate-distortion theory and its generalizations.

LGJun 24, 2014
Generalized Mixability via Entropic Duality

Mark D. Reid, Rafael M. Frongillo, Robert C. Williamson et al.

Mixability is a property of a loss which characterizes when fast convergence is possible in the game of prediction with expert advice. We show that a key property of mixability generalizes, and the exp and log operations present in the usual theory are not as special as one might have thought. In doing this we introduce a more general notion of $Φ$-mixability where $Φ$ is a general entropy (\ie, any convex function on probabilities). We show how a property shared by the convex dual of any such entropy yields a natural algorithm (the minimizer of a regret bound) which, analogous to the classical aggregating algorithm, is guaranteed a constant regret when used with $Φ$-mixable losses. We characterize precisely which $Φ$ have $Φ$-mixable losses and put forward a number of conjectures about the optimality and relationships between different choices of entropy.

LGJun 14, 2014
From Stochastic Mixability to Fast Rates

Nishant A. Mehta, Robert C. Williamson

Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution $\mathsf{P}$ and returns a hypothesis $f$ chosen from a fixed class $\mathcal{F}$ with small loss $\ell$. In the parametric setting, depending upon $(\ell, \mathcal{F},\mathsf{P})$ ERM can have slow $(1/\sqrt{n})$ or fast $(1/n)$ rates of convergence of the excess risk as a function of the sample size $n$. There exist several results that give sufficient conditions for fast rates in terms of joint properties of $\ell$, $\mathcal{F}$, and $\mathsf{P}$, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss $\ell$ (there being no role there for $\mathcal{F}$ or $\mathsf{P}$). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of $(\ell,\mathcal{F}, \mathsf{P})$, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.

LGMar 10, 2014
Generalised Mixability, Constant Regret, and Bayesian Updating

Mark D. Reid, Rafael M. Frongillo, Robert C. Williamson

Mixability of a loss is known to characterise when constant regret bounds are achievable in games of prediction with expert advice through the use of Vovk's aggregating algorithm. We provide a new interpretation of mixability via convex analysis that highlights the role of the Kullback-Leibler divergence in its definition. This naturally generalises to what we call $Φ$-mixability where the Bregman divergence $D_Φ$ replaces the KL divergence. We prove that losses that are $Φ$-mixable also enjoy constant regret bounds via a generalised aggregating algorithm that is similar to mirror descent.

MLFeb 20, 2014
Le Cam meets LeCun: Deficiency and Generic Feature Learning

Brendan van Rooyen, Robert C. Williamson

"Deep Learning" methods attempt to learn generic features in an unsupervised fashion from a large unlabelled data set. These generic features should perform as well as the best hand crafted features for any learning problem that makes use of this data. We provide a definition of generic features, characterize when it is possible to learn them and provide methods closely related to the autoencoder and deep belief network of deep learning. In order to do so we use the notion of deficiency and illustrate its value in studying certain general learning problems.