François-Xavier Vialard

OC
h-index13
19papers
623citations
Novelty53%
AI Score50

19 Papers

CVJun 13, 2022
$\texttt{GradICON}$: Approximate Diffeomorphisms via Gradient Inverse Consistency

Lin Tian, Hastings Greer, François-Xavier Vialard et al. · harvard

We present an approach to learning regular spatial transformations between image pairs in the context of medical image registration. Contrary to optimization-based registration techniques and many modern learning-based methods, we do not directly penalize transformation irregularities but instead promote transformation regularity via an inverse consistency penalty. We use a neural network to predict a map between a source and a target image as well as the map when swapping the source and target images. Different from existing approaches, we compose these two resulting maps and regularize deviations of the $\bf{Jacobian}$ of this composition from the identity matrix. This regularizer -- $\texttt{GradICON}$ -- results in much better convergence when training registration models compared to promoting inverse consistency of the composition of maps directly while retaining the desirable implicit regularization effects of the latter. We achieve state-of-the-art registration performance on a variety of real-world medical image datasets using a single set of hyperparameters and a single non-dataset-specific training protocol.

NAApr 8, 2015
Mixture of Kernels and Iterated Semidirect Product of Diffeomorphisms Groups

Martins Bruveris, Laurent Risser, François-Xavier Vialard

In the framework of large deformation diffeomorphic metric mapping (LDDMM), we develop a multi-scale theory for the diffeomorphism group based on previous works. The purpose of the paper is (1) to develop in details a variational approach for multi-scale analysis of diffeomorphisms, (2) to generalise to several scales the semidirect product representation and (3) to illustrate the resulting diffeomorphic decomposition on synthetic and real images. We also show that the approaches presented in other papers and the mixture of kernels are equivalent.

NADec 7, 2015
Piecewise rigid curve deformation via a Finsler steepest descent

Guillaume Charpiat, Giacomo Nardi, Gabriel Peyré et al.

This paper introduces a novel steepest descent flow in Banach spaces. This extends previous works on generalized gradient descent, notably the work of Charpiat et al., to the setting of Finsler metrics. Such a generalized gradient allows one to take into account a prior on deformations (e.g., piecewise rigid) in order to favor some specific evolutions. We define a Finsler gradient descent method to minimize a functional defined on a Banach space and we prove a convergence theorem for such a method. In particular, we show that the use of non-Hilbertian norms on Banach spaces is useful to study non-convex optimization problems where the geometry of the space might play a crucial role to avoid poor local minima. We show some applications to the curve matching problem. In particular, we characterize piecewise rigid deformations on the space of curves and we study several models to perform piecewise rigid evolution of curves.

MLNov 16, 2022
Unbalanced Optimal Transport, from Theory to Numerics

Thibault Séjourné, Gabriel Peyré, François-Xavier Vialard

Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.

CGApr 16
Gromov-Wasserstein at Scale, Beyond Squared Norms

Guillaume Houry, Jean Feydy, François-Xavier Vialard

A fundamental challenge in data science is to match disparate point sets with each other. While optimal transport efficiently minimizes point displacements under a bijectivity constraint, it is inherently sensitive to rotations. Conversely, minimizing distortions via the Gromov-Wasserstein (GW) framework addresses this limitation but introduces a non-convex, computationally demanding optimization problem. In this work, we identify a broad class of distortion penalties that reduce to a simple alignment problem within a lifted feature space. Leveraging this insight, we introduce an iterative GW solver with a linear memory footprint and quadratic (rather than cubic) time complexity. Our method is differentiable, comes with strong theoretical guarantees, and scales to hundreds of thousands of points in minutes. This efficiency unlocks a wide range of geometric applications and enables the exploration of the GW energy landscape, whose local minima encode the symmetries of the matching problem.

LGDec 12, 2025
Sliced ReLU attention: Quasi-linear contextual expressivity via sorting

François-Xavier Vialard, Siwan Boufadène

We introduce sliced ReLU attention, a new attention mechanism that departs structurally from both softmax and its approximation alternatives. Instead of applying a nonlinearity to pairwise dot products, we operate on one-dimensional projections of key--query differences and leverage sorting to obtain quasi-linear complexity. This construction yields a differentiable, non-symmetric kernel that can be computed in O(n log(n)) through a sorting procedure, making it suitable for very long contexts. Beyond computational benefits, the model retains strong theoretical expressive power: we establish two in-context expressivity results, previously known for softmax attention, showing that sliced ReLU attention preserves the ability to perform nontrivial sequence-to-sequence disentangling tasks and satisfies a contextual universal approximation property. Finally, we illustrate the potential practical interest of this kernel in small to medium-scale experiments.

LGDec 11, 2025
Token Sample Complexity of Attention

Léa Bohbot, Cyril Letrouit, Gabriel Peyré et al.

As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token-sample complexity: the rate at which attention computed on $n$ tokens converges to its infinite-token limit. We estimate finite-$n$ convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius $R$ at rate $C(R)/\sqrt{n}$, where $C(R)$ grows exponentially with $R$. For large $R$, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is $C'(R)/n^β$ with $β<\tfrac{1}{2}$, and $C'(R)$ depends polynomially on the size of the support of the distribution. The exponent $β$ depends on the attention geometry and the spectral properties of the tokens distribution. We also examine the regime in which the attention parameter tends to infinity and the softmax approaches a hardmax, and in this setting, we establish a logarithmic rate of convergence. Experiments on synthetic Gaussian data and real BERT models on Wikipedia text confirm our predictions.

CVJun 1, 2019Code
Region-specific Diffeomorphic Metric Mapping

Zhengyang Shen, François-Xavier Vialard, Marc Niethammer

We introduce a region-specific diffeomorphic metric mapping (RDMM) registration approach. RDMM is non-parametric, estimating spatio-temporal velocity fields which parameterize the sought-for spatial transformation. Regularization of these velocity fields is necessary. However, while existing non-parametric registration approaches, e.g., the large displacement diffeomorphic metric mapping (LDDMM) model, use a fixed spatially-invariant regularization our model advects a spatially-varying regularizer with the estimated velocity field, thereby naturally attaching a spatio-temporal regularizer to deforming objects. We explore a family of RDMM registration approaches: 1) a registration model where regions with separate regularizations are pre-defined (e.g., in an atlas space), 2) a registration model where a general spatially-varying regularizer is estimated, and 3) a registration model where the spatially-varying regularizer is obtained via an end-to-end trained deep learning (DL) model. We provide a variational derivation of RDMM, show that the model can assure diffeomorphic transformations in the continuum, and that LDDMM is a particular instance of RDMM. To evaluate RDMM performance we experiment 1) on synthetic 2D data and 2) on two 3D datasets: knee magnetic resonance images (MRIs) of the Osteoarthritis Initiative (OAI) and computed tomography images (CT) of the lung. Results show that our framework achieves state-of-the-art image registration performance, while providing additional information via a learned spatio-temoporal regularizer. Further, our deep learning approach allows for very fast RDMM and LDDMM estimations. Our code will be open-sourced. Code is available at https://github.com/uncbiag/registration.

LGApr 25, 2025
Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime

Raphaël Barboni, Gabriel Peyré, François-Xavier Vialard

We study the convergence of gradient methods for the training of mean-field single-hidden-layer neural networks with square loss. For this high-dimensional and non-convex optimization problem, most known convergence results are either qualitative or rely on a neural tangent kernel analysis where nonlinear representations of the data are fixed. Using that this problem belongs to the class of separable nonlinear least squares problems, we consider here a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of nonlinear features. In a teacher-student scenario, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Recent results on the asymptotic behavior of such PDEs then give quantitative guarantees for the convergence of the learned feature distribution.

LGMar 19, 2024
Understanding the training of infinitely deep and wide ResNets with Conditional Optimal Transport

Raphaël Barboni, Gabriel Peyré, François-Xavier Vialard

We study the convergence of gradient flow for the training of deep neural networks. If Residual Neural Networks are a popular example of very deep architectures, their training constitutes a challenging optimization problem due notably to the non-convexity and the non-coercivity of the objective. Yet, in applications, those tasks are successfully solved by simple optimization algorithms such as gradient descent. To better understand this phenomenon, we focus here on a ``mean-field'' model of infinitely deep and arbitrarily wide ResNet, parameterized by probability measures over the product set of layers and parameters and with constant marginal on the set of layers. Indeed, in the case of shallow neural networks, mean field models have proven to benefit from simplified loss-landscapes and good theoretical guarantees when trained with gradient flow for the Wasserstein metric on the set of probability measures. Motivated by this approach, we propose to train our model with gradient flow w.r.t. the conditional Optimal Transport distance: a restriction of the classical Wasserstein distance which enforces our marginal condition. Relying on the theory of gradient flows in metric spaces we first show the well-posedness of the gradient flow equation and its consistency with the training of ResNets at finite width. Performing a local Polyak-Łojasiewicz analysis, we then show convergence of the gradient flow for well-chosen initializations: if the number of features is finite but sufficiently large and the risk is sufficiently small at initialization, the gradient flow converges towards a global minimizer. This is the first result of this type for infinitely deep and arbitrarily wide ResNets.

OCFeb 3, 2022
Toric Geometry of Entropic Regularization

Bernd Sturmfels, Simon Telen, François-Xavier Vialard et al.

Entropic regularization is a method for large-scale linear programming. Geometrically, one traces intersections of the feasible polytope with scaled toric varieties, starting at the Birch point. We compare this to log-barrier methods, with reciprocal linear spaces, starting at the analytic center. We revisit entropic regularization for unbalanced optimal transport, and we develop the use of optimal conic couplings. We compute the degree of the associated toric variety, and we explore algorithms like iterative scaling.

OCJan 3, 2022
Faster Unbalanced Optimal Transport: Translation invariant Sinkhorn and 1-D Frank-Wolfe

Thibault Séjourné, François-Xavier Vialard, Gabriel Peyré

Unbalanced optimal transport (UOT) extends optimal transport (OT) to take into account mass variations to compare distributions. This is crucial to make OT successful in ML applications, making it robust to data normalization and outliers. The baseline algorithm is Sinkhorn, but its convergence speed might be significantly slower for UOT than for OT. In this work, we identify the cause for this deficiency, namely the lack of a global normalization of the iterates, which equivalently corresponds to a translation of the dual OT potentials. Our first contribution leverages this idea to develop a provably accelerated Sinkhorn algorithm (coined 'translation invariant Sinkhorn') for UOT, bridging the computational gap with OT. Our second contribution focusses on 1-D UOT and proposes a Frank-Wolfe solver applied to this translation invariant formulation. The linear oracle of each steps amounts to solving a 1-D OT problems, resulting in a linear time complexity per iteration. Our last contribution extends this method to the computation of UOT barycenter of 1-D measures. Numerical simulations showcase the convergence speed improvement brought by these three approaches.

NEDec 10, 2021
Global convergence of ResNets: From finite to infinite width using linear parameterization

Raphaël Barboni, Gabriel Peyré, François-Xavier Vialard

Overparametrization is a key factor in the absence of convexity to explain global convergence of gradient descent (GD) for neural networks. Beside the well studied lazy regime, infinite width (mean field) analysis has been developed for shallow networks, using on convex optimization technics. To bridge the gap between the lazy and mean field regimes, we study Residual Networks (ResNets) in which the residual block has linear parametrization while still being nonlinear. Such ResNets admit both infinite depth and width limits, encoding residual blocks in a Reproducing Kernel Hilbert Space (RKHS). In this limit, we prove a local Polyak-Lojasiewicz inequality. Thus, every critical point is a global minimizer and a local convergence result of GD holds, retrieving the lazy regime. In contrast with other mean-field studies, it applies to both parametric and non-parametric cases under an expressivity condition on the residuals. Our analysis leads to a practical and quantified recipe: starting from a universal RKHS, Random Fourier Features are applied to obtain a finite dimensional parameterization satisfying with high-probability our expressivity condition.

MLDec 3, 2021
Near-optimal estimation of smooth transport maps with kernel sums-of-squares

Boris Muzellec, Adrien Vacher, Francis Bach et al.

It was recently shown that under smoothness conditions, the squared Wasserstein distance between two distributions could be efficiently computed with appealing statistical error upper bounds. However, rather than the distance itself, the object of interest for applications such as generative modeling is the underlying optimal transport map. Hence, computational and statistical guarantees need to be obtained for the estimated maps themselves. In this paper, we propose the first tractable algorithm for which the statistical $L^2$ error on the maps nearly matches the existing minimax lower-bounds for smooth map estimation. Our method is based on solving the semi-dual formulation of optimal transport with an infinite-dimensional sum-of-squares reformulation, and leads to an algorithm which has dimension-free polynomial rates in the number of samples, with potentially exponentially dimension-dependent constants.

OCSep 9, 2020
The Unbalanced Gromov Wasserstein Distance: Conic Formulation and Relaxation

Thibault Séjourné, François-Xavier Vialard, Gabriel Peyré

Comparing metric measure spaces (i.e. a metric space endowed with aprobability distribution) is at the heart of many machine learning problems. The most popular distance between such metric measure spaces is theGromov-Wasserstein (GW) distance, which is the solution of a quadratic assignment problem. The GW distance is however limited to the comparison of metric measure spaces endowed with a probability distribution. To alleviate this issue, we introduce two Unbalanced Gromov-Wasserstein formulations: a distance and a more tractable upper-bounding relaxation.They both allow the comparison of metric spaces equipped with arbitrary positive measures up to isometries. The first formulation is a positive and definite divergence based on a relaxation of the mass conservation constraint using a novel type of quadratically-homogeneous divergence. This divergence works hand in hand with the entropic regularization approach which is popular to solve large scale optimal transport problems. We show that the underlying non-convex optimization problem can be efficiently tackled using a highly parallelizable and GPU-friendly iterative scheme. The second formulation is a distance between mm-spaces up to isometries based on a conic lifting. Lastly, we provide numerical experiments onsynthetic examples and domain adaptation data with a Positive-Unlabeled learning task to highlight the salient features of the unbalanced divergence and its potential applications in ML.

NEJun 18, 2020
A Shooting Formulation of Deep Learning

François-Xavier Vialard, Roland Kwitt, Susan Wei et al.

Continuous-depth neural networks can be viewed as deep limits of discrete neural networks whose dynamics resemble a discretization of an ordinary differential equation (ODE). Although important steps have been taken to realize the advantages of such continuous formulations, most current techniques are not truly continuous-depth as they assume \textit{identical} layers. Indeed, existing works throw into relief the myriad difficulties presented by an infinite-dimensional parameter space in learning a continuous-depth neural ODE. To this end, we introduce a shooting formulation which shifts the perspective from parameterizing a network layer-by-layer to parameterizing over optimal networks described only by a set of initial conditions. For scalability, we propose a novel particle-ensemble parametrization which fully specifies the optimal weight trajectory of the continuous-depth neural network. Our experiments show that our particle-ensemble shooting formulation can achieve competitive performance, especially on long-range forecasting tasks. Finally, though the current work is inspired by continuous-depth neural networks, the particle-ensemble shooting formulation also applies to discrete-time networks and may lead to a new fertile area of research in deep learning parametrization.

OCJun 15, 2020
Faster Wasserstein Distance Estimation with the Sinkhorn Divergence

Lenaic Chizat, Pierre Roussillon, Flavien Léger et al.

The squared Wasserstein distance is a natural quantity to compare probability distributions in a non-parametric setting. This quantity is usually estimated with the plug-in estimator, defined via a discrete optimal transport problem which can be solved to $ε$-accuracy by adding an entropic regularization of order $ε$ and using for instance Sinkhorn's algorithm. In this work, we propose instead to estimate it with the Sinkhorn divergence, which is also built on entropic regularization but includes debiasing terms. We show that, for smooth densities, this estimator has a comparable sample complexity but allows higher regularization levels, of order $ε^{1/2}$, which leads to improved computational complexity bounds and a strong speedup in practice. Our theoretical analysis covers the case of both randomly sampled densities and deterministic discretizations on uniform grids. We also propose and analyze an estimator based on Richardson extrapolation of the Sinkhorn divergence which enjoys improved statistical and computational efficiency guarantees, under a condition on the regularity of the approximation error, which is in particular satisfied for Gaussian densities. We finally demonstrate the efficiency of the proposed estimators with numerical experiments.

OCOct 28, 2019
Sinkhorn Divergences for Unbalanced Optimal Transport

Thibault Séjourné, Jean Feydy, François-Xavier Vialard et al.

Optimal transport induces the Earth Mover's (Wasserstein) distance between probability distributions, a geometric divergence that is relevant to a wide range of problems. Over the last decade, two relaxations of optimal transport have been studied in depth: unbalanced transport, which is robust to the presence of outliers and can be used when distributions don't have the same total mass; entropy-regularized transport, which is robust to sampling noise and lends itself to fast computations using the Sinkhorn algorithm. This paper combines both lines of work to put robust optimal transport on solid ground. Our main contribution is a generalization of the Sinkhorn algorithm to unbalanced transport: our method alternates between the standard Sinkhorn updates and the pointwise application of a contractive function. This implies that entropic transport solvers on grid images, point clouds and sampled distributions can all be modified easily to support unbalanced transport, with a proof of linear convergence that holds in all settings. We then show how to use this method to define pseudo-distances on the full space of positive measures that satisfy key geometric axioms: (unbalanced) Sinkhorn divergences are differentiable, positive, definite, convex, statistically robust and avoid any "entropic bias" towards a shrinkage of the measures' supports.

NAJun 16, 2017
Optimal Transport for Diffeomorphic Registration

Jean Feydy, Benjamin Charlier, François-Xavier Vialard et al.

This paper introduces the use of unbalanced optimal transport methods as a similarity measure for diffeomorphic matching of imaging data. The similarity measure is a key object in diffeomorphic registration methods that, together with the regularization on the deformation, defines the optimal deformation. Most often, these similarity measures are local or non local but simple enough to be computationally fast. We build on recent theoretical and numerical advances in optimal transport to propose fast and global similarity measures that can be used on surfaces or volumetric imaging data. This new similarity measure is computed using a fast generalized Sinkhorn algorithm. We apply this new metric in the LDDMM framework on synthetic and real data, fibres bundles and surfaces and show that better matching results are obtained.