Nathaël Da Costa

LG
h-index20
8papers
31citations
Novelty51%
AI Score49

8 Papers

LGFeb 21, 2023
The Gaussian kernel on the circle and spaces that admit isometric embeddings of the circle

Nathaël Da Costa, Cyrus Mostajeran, Juan-Pablo Ortega

On Euclidean spaces, the Gaussian kernel is one of the most widely used kernels in applications. It has also been used on non-Euclidean spaces, where it is known that there may be (and often are) scale parameters for which it is not positive definite. Hope remains that this kernel is positive definite for many choices of parameter. However, we show that the Gaussian kernel is not positive definite on the circle for any choice of parameter. This implies that on metric spaces in which the circle can be isometrically embedded, such as spheres, projective spaces and Grassmannians, the Gaussian kernel is not positive definite for any parameter.

NAJul 2, 2024
Geometric statistics with subspace structure preservation for SPD matrices

Cyrus Mostajeran, Nathaël Da Costa, Graham Van Goffrier et al.

We present a geometric framework for the processing of SPD-valued data that preserves subspace structures and is based on the efficient computation of extreme generalized eigenvalues. This is achieved through the use of the Thompson geometry of the semidefinite cone. We explore a particular geodesic space structure in detail and establish several properties associated with it. Finally, we review a novel inductive mean of SPD matrices based on this geometry.

LGMay 11
Muon is Not That Special: Random or Inverted Spectra Work Just as Well

Zakhar Shumaylov, Nathaël Da Costa, Peter Zaika et al.

The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.

LGFeb 5, 2025Code
Rethinking Approximate Gaussian Inference in Classification

Bálint Mucsányi, Nathaël Da Costa, Philipp Hennig

In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed. We develop a common formalism to describe such methods, which we view as outputting Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax. However, such softmax Gaussian integrals cannot be solved analytically, and Monte Carlo (MC) approximations can be costly and noisy. We propose to replace the softmax activation by element-wise normCDF or sigmoid, which allows for the accurate sampling-free approximation of predictives. This also enables the approximation of the Gaussian pushforwards by Dirichlet distributions with moment matching. This approach entirely eliminates the runtime and memory overhead associated with MC sampling. We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (ImageNet, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at https://github.com/bmucsanyi/probit.

LGDec 22, 2023
Sample Path Regularity of Gaussian Processes from the Covariance Kernel

Nathaël Da Costa, Marvin Pförtner, Lancelot Da Costa et al.

Gaussian processes (GPs) are the most common formalism for defining probability distributions over spaces of functions. While applications of GPs are myriad, a comprehensive understanding of GP sample paths, i.e. the function spaces over which they define a probability measure, is lacking. In practice, GPs are not constructed through a probability measure, but instead through a mean function and a covariance kernel. In this paper we provide necessary and sufficient conditions on the covariance kernel for the sample paths of the corresponding GP to attain a given regularity. We use the framework of Hölder regularity as it grants particularly straightforward conditions, which simplify further in the cases of stationary and isotropic GPs. We then demonstrate that our results allow for novel and unusually tight characterisations of the sample path regularities of the GPs commonly used in machine learning applications, such as the Matérn GPs.

DGJul 1, 2025
Geometric Gaussian Approximations of Probability Distributions

Nathaël Da Costa, Bálint Mucsányi, Philipp Hennig

Approximating complex probability distributions, such as Bayesian posterior distributions, is of central interest in many applications. We study the expressivity of geometric Gaussian approximations. These consist of approximations by Gaussian pushforwards through diffeomorphisms or Riemannian exponential maps. We first review these two different kinds of geometric Gaussian approximations. Then we explore their relationship to one another. We further provide a constructive proof that such geometric Gaussian approximations are universal, in that they can capture any probability distribution. Finally, we discuss whether, given a family of probability distributions, a common diffeomorphism can be found to obtain uniformly high-quality geometric Gaussian approximations for that family.

LGOct 6, 2025
Closed-Form Last Layer Optimization

Alexandre Galashov, Nathaël Da Costa, Liyuan Xu et al.

Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks -- both regression and classification -- including Fourier Neural Operators and Instrumental Variable Regression.

STAug 1, 2025
Constructive Disintegration and Conditional Modes

Nathaël Da Costa, Marvin Pförtner, Jon Cockayne

Conditioning, the central operation in Bayesian statistics, is formalised by the notion of disintegration of measures. However, due to the implicit nature of their definition, constructing disintegrations is often difficult. A folklore result in machine learning conflates the construction of a disintegration with the restriction of probability density functions onto the subset of events that are consistent with a given observation. We provide a comprehensive set of mathematical tools which can be used to construct disintegrations and apply these to find densities of disintegrations on differentiable manifolds. Using our results, we provide a disturbingly simple example in which the restricted density and the disintegration density drastically disagree. Motivated by applications in approximate Bayesian inference and Bayesian inverse problems, we further study the modes of disintegrations. We show that the recently introduced notion of a "conditional mode" does not coincide in general with the modes of the conditional measure obtained through disintegration, but rather the modes of the restricted measure. We also discuss the implications of the discrepancy between the two measures in practice, advocating for the utility of both approaches depending on the modelling context.