MLMar 28, 2023
Sparse Gaussian Processes with Spherical Harmonic Features RevisitedStefanos Eleftheriadis, Dominic Richards, James Hensman
We revisit the Gaussian process model with spherical harmonic features and study connections between the associated RKHS, its eigenstructure and deep models. Based on this, we introduce a new class of kernels which correspond to deep models of continuous depth. In our formulation, depth can be estimated as a kernel hyper-parameter by optimizing the evidence lower bound. Further, we introduce sparseness in the eigenbasis by variational learning of the spherical harmonic phases. This enables scaling to larger input dimensions than previously, while also allowing for learning of high frequency variations. We validate our approach on machine learning benchmark datasets.
LGMay 27, 2022
Double Deep Q Networks for Sensor Management in Space Situational AwarenessBenedict Oakes, Dominic Richards, Jordi Barr et al.
We present a novel Double Deep Q Network (DDQN) application to a sensor management problem in space situational awareness (SSA). Frequent launches of satellites into Earth orbit pose a significant sensor management challenge, whereby a limited number of sensors are required to detect and track an increasing number of objects. In this paper, we demonstrate the use of reinforcement learning to develop a sensor management policy for SSA. We simulate a controllable Earth-based telescope, which is trained to maximise the number of satellites tracked using an extended Kalman filter. The estimated state covariance matrices for satellites observed under the DDQN policy are greatly reduced compared to those generated by an alternate (random) policy. This work provides the basis for further advancements and motivates the use of reinforcement learning for SSA.
STAug 26, 2021
Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?Dominic Richards, Edgar Dobriban, Patrick Rebeschini
Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performance of the best method in the class. We consider the central problem of linear regression-with a random isotropic ground truth-and investigate the estimation performance of two fundamental methods, gradient descent and ridge regression. We unveil the following phenomena. (1) For general designs, constant stepsize gradient descent outperforms ridge regression when the eigenvalues of the empirical data covariance matrix decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we show that ridge regression outperforms gradient descent. (2) For orthogonal designs, we compute the exact minimax optimal class of estimators (achieving min-max-min optimality), showing it is equivalent to gradient descent with decaying learning rate. We find the sub-optimality of ridge regression and gradient descent with constant step size. Our results highlight that statistical performance can depend strongly on tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our setting, it can be outperformed by gradient descent by an arbitrary/unbounded amount when both methods are only tuned over finitely many regularization parameters.
MLJul 27, 2021
Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent KernelDominic Richards, Ilja Kuzborskij
We revisit on-average algorithmic stability of GD for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the NTK or PL assumptions. In particular, we show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation (in a sense, an interpolating network with the smallest relative norm). While this was known for kernelised interpolants, our proof applies directly to networks trained by GD without intermediate kernelisation. At the same time, by relaxing oracle inequalities developed here we recover existing NTK-based risk bounds in a straightforward way, which demonstrates that our analysis is tighter. Finally, unlike most of the NTK-based analyses we focus on regression with label noise and show that GD with early stopping is consistent.
MLJan 13, 2021
Learning with Gradient Descent and Weakly Convex LossesDominic Richards, Mike Rabbat
We study the learning performance of gradient descent when the empirical risk is weakly convex, namely, the smallest negative eigenvalue of the empirical risk's Hessian is bounded in magnitude. By showing that this eigenvalue can control the stability of gradient descent, generalisation error bounds are proven that hold under a wider range of step sizes compared to previous work. Out of sample guarantees are then achieved by decomposing the test error into generalisation, optimisation and approximation errors, each of which can be bounded and traded off with respect to algorithmic parameters, sample size and magnitude of this eigenvalue. In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity, specifically, the Hessian's smallest eigenvalue during training can be controlled by the normalisation of the layers, i.e., network scaling. This allows test error guarantees to then be achieved when the population risk minimiser satisfies a complexity assumption. By trading off the network complexity and scaling, insights are gained into the implicit bias of neural network scaling, which are further supported by experimental findings.
MLJul 1, 2020
Decentralised Learning with Random Features and Distributed Gradient DescentDominic Richards, Patrick Rebeschini, Lorenzo Rosasco
We investigate the generalisation performance of Distributed Gradient Descent with Implicit Regularisation and Random Features in the homogenous setting where a network of agents are given data sampled independently from the same unknown distribution. Along with reducing the memory footprint, Random Features are particularly convenient in this setting as they provide a common parameterisation across agents that allows to overcome previous difficulties in implementing Decentralised Kernel Regression. Under standard source and capacity assumptions, we establish high probability bounds on the predictive performance for each agent as a function of the step size, number of iterations, inverse spectral gap of the communication matrix and number of Random Features. By tuning these parameters, we obtain statistical rates that are minimax optimal with respect to the total number of samples in the network. The algorithm provides a linear improvement over single machine Gradient Descent in memory cost and, when agents hold enough data with respect to the network size and inverse spectral gap, a linear speed-up in computational runtime for any network topology. We present simulations that show how the number of Random Features, iterations and samples impact predictive performance.
STJun 11, 2020
Asymptotics of Ridge (less) Regression under General Source ConditionDominic Richards, Jaouad Mourtada, Lorenzo Rosasco
We analyze the prediction error of ridge regression in an asymptotic regime where the sample size and dimension go to infinity at a proportional rate. In particular, we consider the role played by the structure of the true regression parameter. We observe that the case of a general deterministic parameter can be reduced to the case of a random parameter from a structured prior. The latter assumption is a natural adaptation of classic smoothness assumptions in nonparametric regression, which are known as source conditions in the the context of regularization theory for inverse problems. Roughly speaking, we assume the large coefficients of the parameter are in correspondence to the principal components. In this setting a precise characterisation of the test error is obtained, depending on the inputs covariance and regression parameter structure. We illustrate this characterisation in a simplified setting to investigate the influence of the true parameter on optimal regularisation for overparameterized models. We show that interpolation (no regularisation) can be optimal even with bounded signal-to-noise ratio (SNR), provided that the parameter coefficients are larger on high-variance directions of the data, corresponding to a more regular function than posited by the regularization term. This contrasts with previous work considering ridge regression with isotropic prior, in which case interpolation is only optimal in the limit of infinite SNR.
STDec 3, 2019
Distributed Machine Learning with Sparse Heterogeneous DataDominic Richards, Sahand N. Negahban, Patrick Rebeschini
Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a graph topology. Precisely, we analyse the case where each node is associated with fitting a sparse linear model, and edges join two nodes if the difference of their solutions is also sparse. We propose a method based on Basis Pursuit Denoising with a total variation penalty, and provide finite sample guarantees for sub-Gaussian design matrices. Taking the root of the tree as a reference node, we show that if the sparsity of the differences across nodes is smaller than the sparsity at the root, then recovery is successful with fewer samples than by solving the problems independently, or by using methods that rely on a large overlap in the signal supports, such as the group Lasso. We consider both the noiseless and noisy setting, and numerically investigate the performance of distributed methods based on Distributed Alternating Direction Methods of Multipliers (ADMM) and hyperspectral unmixing.
MLMay 8, 2019
Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-UpDominic Richards, Patrick Rebeschini
We analyse the learning performance of Distributed Gradient Descent in the context of multi-agent decentralised non-parametric regression with the square loss function when i.i.d. samples are assigned to agents. We show that if agents hold sufficiently many samples with respect to the network size, then Distributed Gradient Descent achieves optimal statistical rates with a number of iterations that scales, up to a threshold, with the inverse of the spectral gap of the gossip matrix divided by the number of samples owned by each agent raised to a problem-dependent power. The presence of the threshold comes from statistics. It encodes the existence of a "big data" regime where the number of required iterations does not depend on the network topology. In this regime, Distributed Gradient Descent achieves optimal statistical rates with the same order of iterations as gradient descent run with all the samples in the network. Provided the communication delay is sufficiently small, the distributed protocol yields a linear speed-up in runtime compared to the single-machine protocol. This is in contrast to decentralised optimisation algorithms that do not exploit statistics and only yield a linear speed-up in graphs where the spectral gap is bounded away from zero. Our results exploit the statistical concentration of quantities held by agents and shed new light on the interplay between statistics and communication in decentralised methods. Bounds are given in the standard non-parametric setting with source/capacity assumptions.
LGSep 18, 2018
Graph-Dependent Implicit Regularisation for Distributed Stochastic Subgradient DescentDominic Richards, Patrick Rebeschini
We propose graph-dependent implicit regularisation strategies for distributed stochastic subgradient descent (Distributed SGD) for convex problems in multi-agent learning. Under the standard assumptions of convexity, Lipschitz continuity, and smoothness, we establish statistical learning rates that retain, up to logarithmic terms, centralised statistical guarantees through implicit regularisation (step size tuning and early stopping) with appropriate dependence on the graph topology. Our approach avoids the need for explicit regularisation in decentralised learning problems, such as adding constraints to the empirical risk minimisation rule. Particularly for distributed methods, the use of implicit regularisation allows the algorithm to remain simple, without projections or dual methods. To prove our results, we establish graph-independent generalisation bounds for Distributed SGD that match the centralised setting (using algorithmic stability), and we establish graph-dependent optimisation bounds that are of independent interest. We present numerical experiments to show that the qualitative nature of the upper bounds we derive can be representative of real behaviours.