NAJun 24, 2011
The Cosparse Analysis Model and AlgorithmsSangnam Nam, Mike E. Davies, Michael Elad et al.
After a decade of extensive study of the sparse representation synthesis model, we can safely say that this is a mature and stable field, with clear theoretical foundations, and appealing applications. Alongside this approach, there is an analysis counterpart model, which, despite its similarity to the synthesis alternative, is markedly different. Surprisingly, the analysis model did not get a similar attention, and its understanding today is shallow and partial. In this paper we take a closer look at the analysis approach, better define it as a generative model for signals, and contrast it with the synthesis one. This work proposes effective pursuit methods that aim to solve inverse problems regularized with the analysis-model prior, accompanied by a preliminary theoretical study of their performance. We demonstrate the effectiveness of the analysis model in several experiments.
LGJun 30, 2023
Abide by the Law and Follow the Flow: Conservation Laws for Gradient FlowsSibylle Marcotte, Rémi Gribonval, Gabriel Peyré
Understanding the geometric properties of gradient descent dynamics is a key ingredient in deciphering the recent success of very large machine learning models. A striking observation is that trained over-parameterized models retain some properties of the optimization initialization. This "implicit bias" is believed to be responsible for some favorable properties of the trained models and could explain their good generalization properties. The purpose of this article is threefold. First, we rigorously expose the definition and basic properties of "conservation laws", that define quantities conserved during gradient flows of a given model (e.g. of a ReLU network with a given architecture) with any training data and any loss. Then we explain how to find the maximal number of independent conservation laws by performing finite-dimensional algebraic manipulations on the Lie algebra generated by the Jacobian of the model. Finally, we provide algorithms to: a) compute a family of polynomial laws; b) compute the maximal number of (not necessarily polynomial) independent conservation laws. We provide showcase examples that we fully work out theoretically. Besides, applying the two algorithms confirms for a number of ReLU network architectures that all known laws are recovered by the algorithm, and that there are no other independent laws. Such computational tools pave the way to understanding desirable properties of optimization initialization in large machine learning models.
LGOct 5, 2022
On the Statistical Complexity of Estimation and Testing under Privacy ConstraintsClément Lalanne, Aurélien Garivier, Rémi Gribonval
The challenge of producing accurate statistics while respecting the privacy of the individuals in a sample is an important area of research. We study minimax lower bounds for classes of differentially private estimators. In particular, we show how to characterize the power of a statistical test under differential privacy in a plug-and-play fashion by solving an appropriate transport problem. With specific coupling constructions, this observation allows us to derive Le Cam-type and Fano-type inequalities not only for regular definitions of differential privacy but also for those based on Renyi divergence. We then proceed to illustrate our results on three simple, fully worked out examples. In particular, we show that the problem class has a huge importance on the provable degradation of utility due to privacy. In certain scenarios, we show that maintaining privacy results in a noticeable reduction in performance only when the level of privacy protection is very high. Conversely, for other problems, even a modest level of privacy protection can lead to a significant decrease in performance. Finally, we demonstrate that the DP-SGLD algorithm, a private convex solver, can be employed for maximum likelihood estimation with a high degree of confidence, as it provides near-optimal results with respect to both the size of the sample and the level of privacy protection. This algorithm is applicable to a broad range of parametric estimation procedures, including exponential families.
MLFeb 14, 2023
Private Statistical Estimation of Many QuantilesClément Lalanne, Aurélien Garivier, Rémi Gribonval
This work studies the estimation of many statistical quantiles under differential privacy. More precisely, given a distribution and access to i.i.d. samples from it, we study the estimation of the inverse of its cumulative distribution function (the quantile function) at specific points. For instance, this task is of key importance in private data generation. We present two different approaches. The first one consists in privately estimating the empirical quantiles of the samples and using this result as an estimator of the quantiles of the distribution. In particular, we study the statistical properties of the recently published algorithm introduced by Kaplan et al. 2022 that privately estimates the quantiles recursively. The second approach is to use techniques of density estimation in order to uniformly estimate the quantile function on an interval. In particular, we show that there is a tradeoff between the two methods. When we want to estimate many quantiles, it is better to estimate the density rather than estimating the quantile function at specific points.
NANov 6, 2017
Analyzing the Approximation Error of the Fast Graph Fourier TransformLuc LeMagoarou, Nicolas Tremblay, Rémi Gribonval
The graph Fourier transform (GFT) is in general dense and requires O(n^2) time to compute and O(n^2) memory space to store. In this paper, we pursue our previous work on the approximate fast graph Fourier transform (FGFT). The FGFT is computed via a truncated Jacobi algorithm, and is defined as the product of J Givens rotations (very sparse orthogonal matrices). The truncation parameter, J, represents a trade-off between precision of the transform and time of computation (and storage space). We explore further this trade-off and study, on different types of graphs, how is the approximation error distributed along the spectrum.
CVJul 28, 2022
Self-supervised learning with rotation-invariant kernelsLéon Zheng, Gilles Puy, Elisa Riccietti et al.
We introduce a regularization loss based on kernel mean embeddings with rotation-invariant kernels on the hypersphere (also known as dot-product kernels) for self-supervised learning of image representations. Besides being fully competitive with the state of the art, our method significantly reduces time and memory complexity for self-supervised training, making it implementable for very large embedding dimensions on existing devices and more easily adjustable than previous methods to settings with limited resources. Our work follows the major paradigm where the model learns to be invariant to some predefined image transformations (cropping, blurring, color jittering, etc.), while avoiding a degenerate solution by regularizing the embedding distribution. Our particular contribution is to propose a loss family promoting the embedding distribution to be close to the uniform distribution on the hypersphere, with respect to the maximum mean discrepancy pseudometric. We demonstrate that this family encompasses several regularizers of former methods, including uniformity-based and information-maximization methods, which are variants of our flexible regularization loss with different kernels. Beyond its practical consequences for state-of-the-art self-supervised learning with limited resources, the proposed generic regularization approach opens perspectives to leverage more widely the literature on kernel methods in order to improve self-supervised learning methods.
LGJun 13, 2022
Compressive Clustering with an Optical Processing UnitLuc Giffon, Rémi Gribonval
We explore the use of Optical Processing Units (OPU) to compute random Fourier features for sketching, and adapt the overall compressive clustering pipeline to this setting. We also propose some tools to help tuning a critical hyper-parameter of compressive clustering.
AIJun 26, 2023
About the Cost of Central Privacy in Density EstimationClément Lalanne, Aurélien Garivier, Rémi Gribonval
We study non-parametric density estimation for densities in Lipschitz and Sobolev spaces, and under central privacy. In particular, we investigate regimes where the privacy budget is not supposed to be constant. We consider the classical definition of central differential privacy, but also the more recent notion of central concentrated differential privacy. We recover the result of Barber and Duchi (2014) stating that histogram estimators are optimal against Lipschitz distributions for the L2 risk, and under regular differential privacy, and we extend it to other norms and notions of privacy. Then, we investigate higher degrees of smoothness, drawing two conclusions: First, and contrary to what happens with constant privacy budget (Wasserman and Zhou, 2010), there are regimes where imposing privacy degrades the regular minimax risk of estimation on Sobolev densities. Second, so-called projection estimators are near-optimal against the same classes of densities in this new setup with pure differential privacy, but contrary to the constant privacy budget case, it comes at the cost of relaxation. With zero concentrated differential privacy, there is no need for relaxation, and we prove that the estimation is optimal.
MLOct 2, 2023
A path-norm toolkit for modern networks: consequences, promises and challengesAntoine Gonon, Nicolas Brisebarre, Elisa Riccietti et al.
This work introduces the first toolkit around path-norms that fully encompasses general DAG ReLU networks with biases, skip connections and any operation based on the extraction of order statistics: max pooling, GroupSort etc. This toolkit notably allows us to establish generalization bounds for modern neural networks that are not only the most widely applicable path-norm based ones, but also recover or beat the sharpest known bounds of this type. These extended path-norms further enjoy the usual benefits of path-norms: ease of computation, invariance under the symmetries of the network, and improved sharpness on layered fully-connected networks compared to the product of operator norms, another complexity measure most commonly used. The versatility of the toolkit and its ease of implementation allow us to challenge the concrete promises of path-norm-based generalization bounds, by numerically evaluating the sharpest known bounds for ResNets on ImageNet.
MLFeb 23
Path-conditioned training: a principled way to rescale ReLU neural networksArthur Lebeurrier, Titouan Vayer, Rémi Gribonval
Despite recent algorithmic advances, we still lack principled ways to leverage the well-documented rescaling symmetries in ReLU neural network parameters. While two properly rescaled weights implement the same function, the training dynamics can be dramatically different. To offer a fresh perspective on exploiting this phenomenon, we build on the recent path-lifting framework, which provides a compact factorization of ReLU networks. We introduce a geometrically motivated criterion to rescale neural network parameters which minimization leads to a conditioning strategy that aligns a kernel in the path-lifting space with a chosen reference. We derive an efficient algorithm to perform this alignment. In the context of random network initialization, we analyze how the architecture and the initialization scale jointly impact the output of the proposed method. Numerical experiments illustrate its potential to speed up training.
MLNov 8, 2023
Compressive Recovery of Sparse Precision MatricesTitouan Vayer, Etienne Lasalle, Rémi Gribonval et al.
We consider the problem of learning a graph modeling the statistical relations of the $d$ variables from a dataset with $n$ samples $X \in \mathbb{R}^{n \times d}$. Standard approaches amount to searching for a precision matrix $Θ$ representative of a Gaussian graphical model that adequately explains the data. However, most maximum likelihood-based estimators usually require storing the $d^{2}$ values of the empirical covariance matrix, which can become prohibitive in a high-dimensional setting. In this work, we adopt a compressive viewpoint and aim to estimate a sparse $Θ$ from a \emph{sketch} of the data, i.e. a low-dimensional vector of size $m \ll d^{2}$ carefully designed from $X$ using non-linear random features. Under certain assumptions on the spectrum of $Θ$ (or its condition number), we show that it is possible to estimate it from a sketch of size $m=Ω\left((d+2k)\log(d)\right)$ where $k$ is the maximal number of edges of the underlying graph. These information-theoretic guarantees are inspired by compressed sensing theory and involve restricted isometry properties and instance optimal decoders. We investigate the possibility of achieving practical recovery with an iterative algorithm based on the graphical lasso, viewed as a specific denoiser. We compare our approach and graphical lasso on synthetic datasets, demonstrating its favorable performance even when the dataset is compressed.
LGMay 21, 2024
Keep the Momentum: Conservation Laws beyond Euclidean Gradient FlowsSibylle Marcotte, Rémi Gribonval, Gabriel Peyré
Conservation laws are well-established in the context of Euclidean gradient flow dynamics, notably for linear or ReLU neural network training. Yet, their existence and principles for non-Euclidean geometries and momentum-based dynamics remain largely unknown. In this paper, we characterize "all" conservation laws in this general setting. In stark contrast to the case of gradient flows, we prove that the conservation laws for momentum-based dynamics exhibit temporal dependence. Additionally, we often observe a "conservation loss" when transitioning from gradient flow to momentum dynamics. Specifically, for linear networks, our framework allows us to identify all momentum conservation laws, which are less numerous than in the gradient flow case except in sufficiently over-parameterized regimes. With ReLU networks, no conservation law remains. This phenomenon also manifests in non-Euclidean metrics, used e.g. for Nonnegative Matrix Factorization (NMF): all conservation laws can be determined in the gradient flow context, yet none persists in the momentum case.
MLDec 9, 2023
Revisiting RIP guarantees for sketching operators on mixture modelsAyoub Belhadji, Rémi Gribonval
In the context of sketching for compressive mixture modeling, we revisit existing proofs of the Restricted Isometry Property of sketching operators with respect to certain mixtures models. After examining the shortcomings of existing guarantees, we propose an alternative analysis that circumvents the need to assume importance sampling when drawing random Fourier features to build random sketching operators. Our analysis is based on new deterministic bounds on the restricted isometry constant that depend solely on the set of frequencies used to define the sketching operator; then we leverage these bounds to establish concentration inequalities for random sketching operators that lead to the desired RIP guarantees. Our analysis also opens the door to theoretical guarantees for structured sketching with frequencies associated to fast random linear operators.
CVMar 6
Training Flow Matching: The Role of Weighting and ParameterizationAnne Gagneux, Ségolène Martin, Rémi Gribonval et al.
We study the training objectives of denoising-based generative models, with a particular focus on loss weighting and output parameterization, including noise-, clean image-, and velocity-based formulations. Through a systematic numerical study, we analyze how these training choices interact with the intrinsic dimensionality of the data manifold, model architecture, and dataset size. Our experiments span synthetic datasets with controlled geometry as well as image data, and compare training objectives using quantitative metrics for denoising accuracy (PSNR across noise levels) and generative quality (FID). Rather than proposing a new method, our goal is to disentangle the various factors that matter when training a flow matching model, in order to provide practical insights on design choices.
LGJan 6, 2025
Convexity in ReLU Neural Networks: beyond ICNNs?Anne Gagneux, Mathurin Massias, Emmanuel Soubies et al.
Convex functions and their gradients play a critical role in mathematical imaging, from proximal optimization to Optimal Transport. The successes of deep learning has led many to use learning-based methods, where fixed functions or operators are replaced by learned neural networks. Regardless of their empirical superiority, establishing rigorous guarantees for these methods often requires to impose structural constraints on neural architectures, in particular convexity. The most popular way to do so is to use so-called Input Convex Neural Networks (ICNNs). In order to explore the expressivity of ICNNs, we provide necessary and sufficient conditions for a ReLU neural network to be convex. Such characterizations are based on product of weights and activations, and write nicely for any architecture in the path-lifting framework. As particular applications, we study our characterizations in depth for 1 and 2-hidden-layer neural networks: we show that every convex function implemented by a 1-hidden-layer ReLU network can be also expressed by an ICNN with the same architecture; however this property no longer holds with more layers. Finally, we provide a numerical procedure that allows an exact check of convexity for ReLU neural networks with a large number of affine regions.
LGDec 15, 2023
Sketch and shift: a robust decoder for compressive clusteringAyoub Belhadji, Rémi Gribonval
Compressive learning is an emerging approach to drastically reduce the memory footprint of large-scale learning, by first summarizing a large dataset into a low-dimensional sketch vector, and then decoding from this sketch the latent information needed for learning. In light of recent progress on information preservation guarantees for sketches based on random features, a major objective is to design easy-to-tune algorithms (called decoders) to robustly and efficiently extract this information. To address the underlying non-convex optimization problems, various heuristics have been proposed. In the case of compressive clustering, the standard heuristic is CL-OMPR, a variant of sliding Frank-Wolfe. Yet, CL-OMPR is hard to tune, and the examination of its robustness was overlooked. In this work, we undertake a scrutinized examination of CL-OMPR to circumvent its limitations. In particular, we show how this algorithm can fail to recover the clusters even in advantageous scenarios. To gain insight, we show how the deficiencies of this algorithm can be attributed to optimization difficulties related to the structure of a correlation function appearing at core steps of the algorithm. To address these limitations, we propose an alternative decoder offering substantial improvements over CL-OMPR. Its design is notably inspired from the mean shift algorithm, a classic approach to detect the local maxima of kernel density estimators. The proposed algorithm can extract clustering information from a sketch of the MNIST dataset that is 10 times smaller than previously.
LGMay 23, 2024
A Rescaling-Invariant Lipschitz Bound Based on Path-Metrics for Modern ReLU Network ParameterizationsAntoine Gonon, Nicolas Brisebarre, Elisa Riccietti et al.
Robustness with respect to weight perturbations underpins guarantees for generalization, pruning and quantization. Existing guarantees rely on Lipschitz bounds in parameter space, cover only plain feed-forward MLPs, and break under the ubiquitous neuron-wise rescaling symmetry of ReLU networks. We prove a new Lipschitz inequality expressed through the $\ell^1$-path-metric of the weights. The bound is (i) rescaling-invariant by construction and (ii) applies to any ReLU-DAG architecture with any combination of convolutions, skip connections, pooling, and frozen (inference-time) batch-normalization -- thus encompassing ResNets, U-Nets, VGG-style CNNs, and more. By respecting the network's natural symmetries, the new bound strictly sharpens prior parameter-space bounds and can be computed in two forward passes. To illustrate its utility, we derive from it a symmetry-aware pruning criterion and show -- through a proof-of-concept experiment on a ResNet-18 trained on ImageNet -- that its pruning performance matches that of classical magnitude pruning, while becoming totally immune to arbitrary neuron-wise rescalings.
LGAug 10, 2025
Intrinsic training dynamics of deep neural networksSibylle Marcotte, Gabriel Peyré, Rémi Gribonval
A fundamental challenge in the theory of deep learning is to understand whether gradient-based training in high-dimensional parameter spaces can be captured by simpler, lower-dimensional structures, leading to so-called implicit bias. As a stepping stone, we study when a gradient flow on a high-dimensional variable $θ$ implies an intrinsic gradient flow on a lower-dimensional variable $z = φ(θ)$, for an architecture-related function $φ$. We express a so-called intrinsic dynamic property and show how it is related to the study of conservation laws associated with the factorization $φ$. This leads to a simple criterion based on the inclusion of kernels of linear maps which yields a necessary condition for this property to hold. We then apply our theory to general ReLU networks of arbitrary depth and show that, for any initialization, it is possible to rewrite the flow as an intrinsic dynamic in a lower dimension that depends only on $z$ and the initialization, when $φ$ is the so-called path-lifting. In the case of linear networks with $φ$ the product of weight matrices, so-called balanced initializations are also known to enable such a dimensionality reduction; we generalize this result to a broader class of {\em relaxed balanced} initializations, showing that, in certain configurations, these are the \emph{only} initializations that ensure the intrinsic dynamic property. Finally, for the linear neural ODE associated with the limit of infinitely deep linear networks, with relaxed balanced initialization, we explicitly express the corresponding intrinsic dynamics.
LGDec 18, 2024
PASCO (PArallel Structured COarsening): an overlay to speed up graph clustering algorithmsEtienne Lasalle, Rémi Vaudaine, Titouan Vayer et al.
Clustering the nodes of a graph is a cornerstone of graph analysis and has been extensively studied. However, some popular methods are not suitable for very large graphs: e.g., spectral clustering requires the computation of the spectral decomposition of the Laplacian matrix, which is not applicable for large graphs with a large number of communities. This work introduces PASCO, an overlay that accelerates clustering algorithms. Our method consists of three steps: 1-We compute several independent small graphs representing the input graph by applying an efficient and structure-preserving coarsening algorithm. 2-A clustering algorithm is run in parallel onto each small graph and provides several partitions of the initial graph. 3-These partitions are aligned and combined with an optimal transport method to output the final partition. The PASCO framework is based on two key contributions: a novel global algorithm structure designed to enable parallelization and a fast, empirically validated graph coarsening algorithm that preserves structural properties. We demonstrate the strong performance of 1 PASCO in terms of computational efficiency, structural preservation, and output partition quality, evaluated on both synthetic and real-world graph datasets.
CVOct 28, 2025
The Generation Phases of Flow Matching: a Denoising PerspectiveAnne Gagneux, Ségolène Martin, Rémi Gribonval et al.
Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
MLSep 30, 2025
Non-Vacuous Generalization Bounds: Can Rescaling Invariances Help?Damien Rouchouse, Antoine Gonon, Rémi Gribonval et al.
A central challenge in understanding generalization is to obtain non-vacuous guarantees that go beyond worst-case complexity over data or weight space. Among existing approaches, PAC-Bayes bounds stand out as they can provide tight, data-dependent guarantees even for large networks. However, in ReLU networks, rescaling invariances mean that different weight distributions can represent the same function while leading to arbitrarily different PAC-Bayes complexities. We propose to study PAC-Bayes bounds in an invariant, lifted representation that resolves this discrepancy. This paper explores both the guarantees provided by this approach (invariance, tighter bounds via data processing) and the algorithmic aspects of KL-based rescaling-invariant PAC-Bayes bounds.
MLFeb 15, 2022
Private Quantiles Estimation in the Presence of AtomsClément Sébastien Lalanne, Clément Gastaud, Nicolas Grislain et al.
We consider the differentially private estimation of multiple quantiles (MQ) of a distribution from a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem. We establish that the resulting method is closely related to the recently published ad hoc algorithm JointExp. In particular, they share the same computational complexity and a similar efficiency. We prove the statistical consistency of these two algorithms for continuous distributions. Furthermore, we demonstrate both theoretically and empirically that this method suffers from an important lack of performance in the case of peaked distributions, which can degrade up to a potentially catastrophic impact in the presence of atoms. Its smoothed version (i.e. by applying a max kernel to its output density) would solve this problem, but remains an open challenge to implement. As a proxy, we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.
MLDec 1, 2021
Controlling Wasserstein Distances by Kernel Norms with Application to Compressive Statistical LearningTitouan Vayer, Rémi Gribonval
Comparing probability distributions is at the crux of many machine learning algorithms. Maximum Mean Discrepancies (MMD) and Wasserstein distances are two classes of distances between probability distributions that have attracted abundant attention in past years. This paper establishes some conditions under which the Wasserstein distance can be controlled by MMD norms. Our work is motivated by the compressive statistical learning (CSL) theory, a general framework for resource-efficient large scale learning in which the training data is summarized in a single vector (called sketch) that captures the information relevant to the considered learning task. Inspired by existing results in CSL, we introduce the Hölder Lower Restricted Isometric Property and show that this property comes with interesting guarantees for compressive statistical learning. Based on the relations between the MMD and the Wasserstein distances, we provide guarantees for compressive statistical learning by introducing and studying the concept of Wasserstein regularity of the learning task, that is when some task-specific metric between probability distributions can be bounded by a Wasserstein distance.
LGOct 4, 2021
Identifiability in Two-Layer Sparse Matrix FactorizationLéon Zheng, Elisa Riccietti, Rémi Gribonval
Sparse matrix factorization is the problem of approximating a matrix $\mathbf{Z}$ by a product of $J$ sparse factors $\mathbf{X}^{(J)} \mathbf{X}^{(J-1)} \ldots \mathbf{X}^{(1)}$. This paper focuses on identifiability issues that appear in this problem, in view of better understanding under which sparsity constraints the problem is well-posed. We give conditions under which the problem of factorizing a matrix into \emph{two} sparse factors admits a unique solution, up to unavoidable permutation and scaling equivalences. Our general framework considers an arbitrary family of prescribed sparsity patterns, allowing us to capture more structured notions of sparsity than simply the count of nonzero entries. These conditions are shown to be related to essential uniqueness of exact matrix decomposition into a sum of rank-one matrices, with structured sparsity constraints. In particular, in the case of fixed-support sparse matrix factorization, we give a general sufficient condition for identifiability based on rank-one matrix completability, and we derive from it a completion algorithm that can verify if this sufficient condition is satisfied, and recover the entries in the two sparse factors if this is the case. A companion paper further exploits these conditions to derive identifiability properties and theoretically sound factorization methods for multi-layer sparse matrix factorization with support constraints associated to some well-known fast transforms such as the Hadamard or the Discrete Fourier Transforms.
LGOct 4, 2021
Efficient Identification of Butterfly Sparse Matrix FactorizationsLéon Zheng, Elisa Riccietti, Rémi Gribonval
Fast transforms correspond to factorizations of the form $\mathbf{Z} = \mathbf{X}^{(1)} \ldots \mathbf{X}^{(J)}$, where each factor $ \mathbf{X}^{(\ell)}$ is sparse and possibly structured. This paper investigates essential uniqueness of such factorizations, i.e., uniqueness up to unavoidable scaling ambiguities. Our main contribution is to prove that any $N \times N$ matrix having the so-called butterfly structure admits an essentially unique factorization into $J$ butterfly factors (where $N = 2^{J}$), and that the factors can be recovered by a hierarchical factorization method, which consists in recursively factorizing the considered matrix into two factors. This hierarchical identifiability property relies on a simple identifiability condition in the two-layer and fixed-support setting. This approach contrasts with existing ones that fit the product of butterfly factors to a given matrix via gradient descent. The proposed method can be applied in particular to retrieve the factorization of the Hadamard or the discrete Fourier transform matrices of size $N=2^J$. Computing such factorizations costs $\mathcal{O}(N^{2})$, which is of the order of dense matrix-vector multiplication, while the obtained factorizations enable fast $\mathcal{O}(N \log N)$ matrix-vector multiplications and have the potential to be applied to compress deep neural networks.
LGJul 20, 2021
An Embedding of ReLU Networks and an Analysis of their IdentifiabilityPierre Stock, Rémi Gribonval
Neural networks with the Rectified Linear Unit (ReLU) nonlinearity are described by a vector of parameters $θ$, and realized as a piecewise linear continuous function $R_θ: x \in \mathbb R^{d} \mapsto R_θ(x) \in \mathbb R^{k}$. Natural scalings and permutations operations on the parameters $θ$ leave the realization unchanged, leading to equivalence classes of parameters that yield the same realization. These considerations in turn lead to the notion of identifiability -- the ability to recover (the equivalence class of) $θ$ from the sole knowledge of its realization $R_θ$. The overall objective of this paper is to introduce an embedding for ReLU neural networks of any depth, $Φ(θ)$, that is invariant to scalings and that provides a locally linear parameterization of the realization of the network. Leveraging these two key properties, we derive some conditions under which a deep ReLU network is indeed locally identifiable from the knowledge of the realization on a finite set of samples $x_{i} \in \mathbb R^{d}$. We study the shallow case in more depth, establishing necessary and sufficient conditions for the network to be identifiable from a bounded subset $\mathcal X \subseteq \mathbb R^{d}$.
MLJan 5, 2021
Minibatch optimal transport distances; analysis and applicationsKilian Fatras, Younes Zine, Szymon Majewski et al.
Optimal transport distances have become a classic tool to compare probability distributions and have found many applications in machine learning. Yet, despite recent algorithmic developments, their complexity prevents their direct use on large scale datasets. To overcome this challenge, a common workaround is to compute these distances on minibatches i.e. to average the outcome of several smaller optimal transport problems. We propose in this paper an extended analysis of this practice, which effects were previously studied in restricted cases. We first consider a large variety of Optimal Transport kernels. We notably argue that the minibatch strategy comes with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with limits: the minibatch OT is not a distance. To recover some of the lost distance axioms, we introduce a debiased minibatch OT function and study its statistical and optimisation properties. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, generative adversarial networks (GANs) or color transfer that highlight the practical interest of this strategy.
MLAug 4, 2020
Sketching Datasets for Large-Scale Learning (long version)Rémi Gribonval, Antoine Chatalic, Nicolas Keriven et al.
This article considers "compressive learning," an approach to large-scale machine learning where datasets are massively compressed before learning (e.g., clustering, classification, or regression) is performed. In particular, a "sketch" is first constructed by computing carefully chosen nonlinear random features (e.g., random Fourier features) and averaging them over the whole dataset. Parameters are then learned from the sketch, without access to the original dataset. This article surveys the current state-of-the-art in compressive learning, including the main concepts and algorithms, their connections with established signal-processing methods, existing theoretical guarantees -- on both information preservation and privacy preservation, and important open problems.
SDMay 19, 2020
Sparsity-based audio declipping methods: selected overview, new algorithms, and large-scale evaluationClément Gaultier, Srđan Kitić, Rémi Gribonval et al.
Recent advances in audio declipping have substantially improved the state of the art.% in certain saturation regimes. Yet, practitioners need guidelines to choose a method, and while existing benchmarks have been instrumental in advancing the field, larger-scale experiments are needed to guide such choices. First, we show that the clipping levels in existing small-scale benchmarks are moderate and call for benchmarks with more perceptually significant clipping levels. We then propose a general algorithmic framework for declipping that covers existing and new combinations of variants of state-of-the-art techniques exploiting time-frequency sparsity: synthesis vs. analysis sparsity, with plain or structured sparsity. Finally, we systematically compare these combinations and a selection of state-of-the-art methods. Using a large-scale numerical benchmark and a smaller scale formal listening test, we provide guidelines for various clipping levels, both for speech and various musical genres. The code is made publicly available for the purpose of reproducible research and benchmarking.
LGApr 17, 2020
Statistical Learning Guarantees for Compressive Clustering and Compressive Mixture ModelingRémi Gribonval, Gilles Blanchard, Nicolas Keriven et al.
We provide statistical learning guarantees for two unsupervised learning tasks in the context of compressive statistical learning, a general framework for resource-efficient large-scale learning that we introduced in a companion paper.The principle of compressive statistical learning is to compress a training collection, in one pass, into a low-dimensional sketch (a vector of random empirical generalized moments) that captures the information relevant to the considered learning task. We explicitly describe and analyze random feature functions which empirical averages preserve the needed information for compressive clustering and compressive Gaussian mixture modeling with fixed known variance, and establish sufficient sketch sizes given the problem dimensions.
MLOct 9, 2019
Learning with minibatch Wasserstein : asymptotic and gradient propertiesKilian Fatras, Younes Zine, Rémi Flamary et al.
Optimal transport distances are powerful tools to compare probability distributions and have found many applications in machine learning. Yet their algorithmic complexity prevents their direct use on large scale datasets. To overcome this challenge, practitioners compute these distances on minibatches {\em i.e.} they average the outcome of several smaller optimal transport problems. We propose in this paper an analysis of this practice, which effects are not well understood so far. We notably argue that it is equivalent to an implicit regularization of the original problem, with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with defects such as loss of distance property. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, GANs or color transfer that highlight the practical interest of this strategy.
CVJul 12, 2019
And the Bit Goes Down: Revisiting the Quantization of Neural NetworksPierre Stock, Armand Joulin, Rémi Gribonval et al.
In this paper, we address the problem of reducing the memory footprint of convolutional network architectures. We introduce a vector quantization method that aims at preserving the quality of the reconstruction of the network outputs rather than its weights. The principle of our approach is that it minimizes the loss reconstruction error for in-domain inputs. Our method only requires a set of unlabelled data at quantization time and allows for efficient inference on CPU by using byte-aligned codebooks to store the compressed weights. We validate our approach by quantizing a high performing ResNet-50 model to a memory size of 5MB (20x compression factor) while preserving a top-1 accuracy of 76.1% on ImageNet object classification and by compressing a Mask R-CNN with a 26x factor.
LGJul 3, 2019
Don't take it lightly: Phasing optical random projections with unknown operatorsSidharth Gupta, Rémi Gribonval, Laurent Daudet et al.
In this paper we tackle the problem of recovering the phase of complex linear measurements when only magnitude information is available and we control the input. We are motivated by the recent development of dedicated optics-based hardware for rapid random projections which leverages the propagation of light in random media. A signal of interest $\mathbfξ \in \mathbb{R}^N$ is mixed by a random scattering medium to compute the projection $\mathbf{y} = \mathbf{A} \mathbfξ$, with $\mathbf{A} \in \mathbb{C}^{M \times N}$ being a realization of a standard complex Gaussian iid random matrix. Such optics-based matrix multiplications can be much faster and energy-efficient than their CPU or GPU counterparts, yet two difficulties must be resolved: only the intensity ${|\mathbf{y}|}^2$ can be recorded by the camera, and the transmission matrix $\mathbf{A}$ is unknown. We show that even without knowing $\mathbf{A}$, we can recover the unknown phase of $\mathbf{y}$ for some equivalent transmission matrix with the same distribution as $\mathbf{A}$. Our method is based on two observations: first, conjugating or changing the phase of any row of $\mathbf{A}$ does not change its distribution; and second, since we control the input we can interfere $\mathbfξ$ with arbitrary reference signals. We show how to leverage these observations to cast the measurement phase retrieval problem as a Euclidean distance geometry problem. We demonstrate appealing properties of the proposed algorithm in both numerical simulations and real hardware experiments. Not only does our algorithm accurately recover the missing phase, but it mitigates the effects of quantization and the sensitivity threshold, thus improving the measured magnitudes.
FAMay 3, 2019
Approximation spaces of deep neural networksRémi Gribonval, Gitta Kutyniok, Morten Nielsen et al.
We study the expressivity of deep neural networks. Measuring a network's complexity by its number of connections or by its number of neurons, we consider the class of functions for which the error of best approximation with networks of a given complexity decays at a certain rate when increasing the complexity budget. Using results from classical approximation theory, we show that this class can be endowed with a (quasi)-norm that makes it a linear function space, called approximation space. We establish that allowing the networks to have certain types of "skip connections" does not change the resulting approximation spaces. We also discuss the role of the network's nonlinearity (also known as activation function) on the resulting spaces, as well as the role of depth. For the popular ReLU nonlinearity and its powers, we relate the newly constructed spaces to classical Besov spaces. The established embeddings highlight that some functions of very low Besov smoothness can nevertheless be well approximated by neural networks, if these networks are sufficiently deep.
CVFeb 27, 2019
Equi-normalization of Neural NetworksPierre Stock, Benjamin Graham, Rémi Gribonval et al.
Modern neural networks are over-parametrized. In particular, each rectified linear hidden unit can be modified by a multiplicative factor by adjusting input and output weights, without changing the rest of the network. Inspired by the Sinkhorn-Knopp algorithm, we introduce a fast iterative method for minimizing the L2 norm of the weights, equivalently the weight decay regularizer. It provably converges to a unique solution. Interleaving our algorithm with SGD during training improves the test accuracy. For small batches, our approach offers an alternative to batch-and group-normalization on CIFAR-10 and ImageNet with a ResNet-18.
LGDec 17, 2018
Stable safe screening and structured dictionaries for faster L1 regularizationCassio Fraga Dantas, Rémi Gribonval
In this paper, we propose a way to combine two acceleration techniques for the $\ell\_{1}$-regularized least squares problem: safe screening tests, which allow to eliminate useless dictionary atoms; and the use of fast structured approximations of the dictionary matrix. To do so, we introduce a new family of screening tests, termed stable screening, which can cope with approximation errors on the dictionary atoms while keeping the safety of the test (i.e. zero risk of rejecting atoms belonging to the solution support). Some of the main existing screening tests are extended to this new framework. The proposed algorithm consists in using a coarser (but faster) approximation of the dictionary at the initial iterations and then switching to better approximations until eventually adopting the original dictionary. A systematic switching criterion based on the duality gap saturation and the screening ratio is derived.Simulation results show significant reductions in both computational complexity and execution times for a wide range of tested scenarios.
SDOct 31, 2018
MULAN: A Blind and Off-Grid Method for Multichannel Echo RetrievalHelena Peic Tukuljac, Antoine Deleforge, Rémi Gribonval
This paper addresses the general problem of blind echo retrieval, i.e., given M sensors measuring in the discrete-time domain M mixtures of K delayed and attenuated copies of an unknown source signal, can the echo locations and weights be recovered? This problem has broad applications in fields such as sonars, seismol-ogy, ultrasounds or room acoustics. It belongs to the broader class of blind channel identification problems, which have been intensively studied in signal processing. Existing methods in the literature proceed in two steps: (i) blind estimation of sparse discrete-time filters and (ii) echo information retrieval by peak-picking on filters. The precision of these methods is fundamentally limited by the rate at which the signals are sampled: estimated echo locations are necessary on-grid, and since true locations never match the sampling grid, the weight estimation precision is impacted. This is the so-called basis-mismatch problem in compressed sensing. We propose a radically different approach to the problem, building on the framework of finite-rate-of-innovation sampling. The approach operates directly in the parameter-space of echo locations and weights, and enables near-exact blind and off-grid echo retrieval from discrete-time measurements. It is shown to outperform conventional methods by several orders of magnitude in precision.
ITFeb 27, 2018
Instance Optimal Decoding and the Restricted Isometry PropertyNicolas Keriven, Rémi Gribonval
In this paper, we address the question of information preservation in ill-posed, non-linear inverse problems, assuming that the measured data is close to a low-dimensional model set. We provide necessary and sufficient conditions for the existence of a so-called instance optimal decoder, i.e., that is robust to noise and modelling error. Inspired by existing results in compressive sensing, our analysis is based on a (Lower) Restricted Isometry Property (LRIP), formulated in a non-linear fashion. We also provide sufficient conditions for non-uniform recovery with random measurement operators, with a new formulation of the LRIP. We finish by describing typical strategies to prove the LRIP in both linear and non-linear cases, and illustrate our results by studying the invertibility of a one-layer neural net with random weights.
CVDec 12, 2017
Learning a Complete Image Indexing PipelineHimalaya Jain, Joaquin Zepeda, Patrick Pérez et al.
To work at scale, a complete image indexing system comprises two components: An inverted file index to restrict the actual search to only a subset that should contain most of the items relevant to the query; An approximate distance computation mechanism to rapidly scan these lists. While supervised deep learning has recently enabled improvements to the latter, the former continues to be based on unsupervised clustering in the literature. In this work, we propose a first system that learns both components within a unifying neural framework of structured binary encoding.
SDNov 30, 2017
A modeling and algorithmic framework for (non)social (co)sparse audio restorationClément Gaultier, Nancy Bertin, Srđan Kitić et al.
We propose a unified modeling and algorithmic framework for audio restoration problem. It encompasses analysis sparse priors as well as more classical synthesis sparse priors, and regular sparsity as well as various forms of structured sparsity embodied by shrinkage operators (such as social shrinkage). The versatility of the framework is illustrated on two restoration scenarios: denoising, and declipping. Extensive experimental results on these scenarios highlight both the speedups of 20% or even more offered by the analysis sparse prior, and the substantial declipping quality that is achievable with both the social and the plain flavor. While both flavors overall exhibit similar performance, their detailed comparison displays distinct trends depending whether declipping or denoising is considered.
CVAug 9, 2017
SUBIC: A supervised, structured binary code for image searchHimalaya Jain, Joaquin Zepeda, Patrick Pérez et al.
For large-scale visual search, highly compressed yet meaningful representations of images are essential. Structured vector quantizers based on product quantization and its variants are usually employed to achieve such compression while minimizing the loss of accuracy. Yet, unlike binary hashing schemes, these unsupervised methods have not yet benefited from the supervision, end-to-end learning and novel architectures ushered in by the deep learning revolution. We hence propose herein a novel method to make deep convolutional neural networks produce supervised, compact, structured binary codes for visual search. Our method makes use of a novel block-softmax non-linearity and of batch-based entropy losses that together induce structure in the learned encodings. We show that our method outperforms state-of-the-art compact representations based on deep hashing or structured quantization in single and cross-domain category retrieval, instance retrieval and classification. We make our code and models publicly available online.
MLJun 22, 2017
Compressive Statistical Learning with Random Feature MomentsRémi Gribonval, Gilles Blanchard, Nicolas Keriven et al.
We describe a general framework -- compressive statistical learning -- for resource-efficient large-scale learning: the training collection is compressed in one pass into a low-dimensional sketch (a vector of random empirical generalized moments) that captures the information relevant to the considered learning task. A near-minimizer of the risk is computed from the sketch through the solution of a nonlinear least squares problem. We investigate sufficient sketch sizes to control the generalization error of this procedure. The framework is illustrated on compressive PCA, compressive clustering, and compressive Gaussian mixture Modeling with fixed known variance. The latter two are further developed in a companion paper.
NAJun 16, 2017
Approximate fast graph Fourier transforms via multi-layer sparse approximationsLuc Le Magoarou, Rémi Gribonval, Nicolas Tremblay
The Fast Fourier Transform (FFT) is an algorithm of paramount importance in signal processing as it allows to apply the Fourier transform in O(n log n) instead of O(n 2) arithmetic operations. Graph Signal Processing (GSP) is a recent research domain that generalizes classical signal processing tools, such as the Fourier transform, to situations where the signal domain is given by any arbitrary graph instead of a regular grid. Today, there is no method to rapidly apply graph Fourier transforms. We propose in this paper a method to obtain approximate graph Fourier transforms that can be applied rapidly and stored efficiently. It is based on a greedy approximate diagonalization of the graph Laplacian matrix, carried out using a modified version of the famous Jacobi eigenvalues algorithm. The method is described and analyzed in detail, and then applied to both synthetic and real graphs, showing its potential.
LGOct 27, 2016
Compressive K-meansNicolas Keriven, Nicolas Tremblay, Yann Traonmilin et al.
The Lloyd-Max algorithm is a classical approach to perform K-means clustering. Unfortunately, its cost becomes prohibitive as the training dataset grows large. We propose a compressive version of K-means (CKM), that estimates cluster centers from a sketch, i.e. from a drastically compressed representation of the training dataset. We demonstrate empirically that CKM performs similarly to Lloyd-Max, for a sketch size proportional to the number of cen-troids times the ambient dimension, and independent of the size of the original dataset. Given the sketch, the computational complexity of CKM is also independent of the size of the dataset. Unlike Lloyd-Max which requires several replicates, we further demonstrate that CKM is almost insensitive to initialization. For a large dataset of 10^7 data points, we show that CKM can run two orders of magnitude faster than five replicates of Lloyd-Max, with similar clustering performance on artificial data. Finally, CKM achieves lower classification errors on handwritten digits classification.
CVAug 10, 2016
Approximate search with quantized sparse representationsHimalaya Jain, Patrick Pérez, Rémi Gribonval et al.
This paper tackles the task of storing a large collection of vectors, such as visual descriptors, and of searching in it. To this end, we propose to approximate database vectors by constrained sparse coding, where possible atom weights are restricted to belong to a finite subset. This formulation encompasses, as particular cases, previous state-of-the-art methods such as product or residual quantization. As opposed to traditional sparse coding methods, quantized sparse coding includes memory usage as a design constraint, thereby allowing us to index a large collection such as the BIGANN billion-sized benchmark. Our experiments, carried out on standard benchmarks, show that our formulation leads to competitive solutions when considering different trade-offs between learning/coding time, index size and search quality.
LGJun 9, 2016
Sketching for Large-Scale Learning of Mixture ModelsNicolas Keriven, Anthony Bourrier, Rémi Gribonval et al.
Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing.
SINov 16, 2015
Random sampling of bandlimited signals on graphsGilles Puy, Nicolas Tremblay, Rémi Gribonval et al.
We study the problem of sampling k-bandlimited signals on graphs. We propose two sampling strategies that consist in selecting a small subset of nodes at random. The first strategy is non-adaptive, i.e., independent of the graph structure, and its performance depends on a parameter called the graph coherence. On the contrary, the second strategy is adaptive but yields optimal results. Indeed, no more than O(k log(k)) measurements are sufficient to ensure an accurate and stable recovery of all k-bandlimited signals. This second strategy is based on a careful choice of the sampling distribution, which can be estimated quickly. Then, we propose a computationally efficient decoder to reconstruct k-bandlimited signals from their samples. We prove that it yields accurate reconstructions and that it is also stable to noise. Finally, we conduct several experiments to test these techniques.
LGJun 24, 2015
Flexible Multi-layer Sparse Approximations of Matrices and ApplicationsLuc Le Magoarou, Rémi Gribonval
The computational cost of many signal processing and machine learning techniques is often dominated by the cost of applying certain linear operators to high-dimensional vectors. This paper introduces an algorithm aimed at reducing the complexity of applying linear operators in high dimension by approximately factorizing the corresponding matrix into few sparse factors. The approach relies on recent advances in non-convex optimization. It is first explained and analyzed in details and then demonstrated experimentally on various problems including dictionary learning for image denoising, and the approximation of large matrices arising in inverse problems.
SDJun 5, 2015
Sparsity and cosparsity for audio declipping: a flexible non-convex approachSrđan Kitić, Nancy Bertin, Rémi Gribonval
This work investigates the empirical performance of the sparse synthesis versus sparse analysis regularization for the ill-posed inverse problem of audio declipping. We develop a versatile non-convex heuristics which can be readily used with both data models. Based on this algorithm, we report that, in most cases, the two models perform almost similarly in terms of signal enhancement. However, the analysis version is shown to be amenable for real time audio processing, when certain analysis operators are considered. Both versions outperform state-of-the-art methods in the field, especially for the severely saturated signals.
LGMar 9, 2015
Learning Co-Sparse Analysis Operators with Separable StructuresMatthias Seibert, Julian Wörmann, Rémi Gribonval et al.
In the co-sparse analysis model a set of filters is applied to a signal out of the signal class of interest yielding sparse filter responses. As such, it may serve as a prior in inverse problems, or for structural analysis of signals that are known to belong to the signal class. The more the model is adapted to the class, the more reliable it is for these purposes. The task of learning such operators for a given class is therefore a crucial problem. In many applications, it is also required that the filter responses are obtained in a timely manner, which can be achieved by filters with a separable structure. Not only can operators of this sort be efficiently used for computing the filter responses, but they also have the advantage that less training samples are required to obtain a reliable estimate of the operator. The first contribution of this work is to give theoretical evidence for this claim by providing an upper bound for the sample complexity of the learning process. The second is a stochastic gradient descent (SGD) method designed to learn an analysis operator with separable structures, which includes a novel and efficient step size selection rule. Numerical experiments are provided that link the sample complexity to the convergence speed of the SGD algorithm.