LGJun 9, 2022
A general approximation lower bound in $L^p$ norm, with applications to feed-forward neural networksEl Mehdi Achour, Armand Foucault, Sébastien Gerchinovitz et al.
We study the fundamental limits to the expressive power of neural networks. Given two sets $F$, $G$ of real-valued functions, we first prove a general lower bound on how well functions in $F$ can be approximated in $L^p(μ)$ norm by functions in $G$, for any $p \geq 1$ and any probability measure $μ$. The lower bound depends on the packing number of $F$, the range of $F$, and the fat-shattering dimension of $G$. We then instantiate this bound to the case where $G$ corresponds to a piecewise-polynomial feed-forward neural network, and describe in details the application to two sets $F$: H{ö}lder balls and multivariate monotonic functions. Beside matching (known or new) upper bounds up to log factors, our lower bounds shed some light on the similarities or differences between approximation in $L^p$ norm or in sup norm, solving an open question by DeVore et al. (2021). Our proof strategy differs from the sup norm case and uses a key probability result of Mendelson (2002).
LGJan 31, 2023
Straight-Through meets Sparse Recovery: the Support Exploration AlgorithmMimoun Mohamed, François Malgouyres, Valentin Emiya et al.
The {\it straight-through estimator} (STE) is commonly used to optimize quantized neural networks, yet its contexts of effective performance are still unclear despite empirical successes.To make a step forward in this comprehension, we apply STE to a well-understood problem: {\it sparse support recovery}. We introduce the {\it Support Exploration Algorithm} (SEA), a novel algorithm promoting sparsity, and we analyze its performance in support recovery (a.k.a. model selection) problems. SEA explores more supports than the state-of-the-art, leading to superior performance in experiments, especially when the columns of $A$ are strongly coherent.The theoretical analysis considers recovery guarantees when the linear measurements matrix $A$ satisfies the {\it Restricted Isometry Property} (RIP).The sufficient conditions of recovery are comparable but more stringent than those of the state-of-the-art in sparse support recovery. Their significance lies mainly in their applicability to an instance of the STE.
LGJan 28, 2025
HadamRNN: Binary and Sparse Ternary Orthogonal RNNsArmand Foucault, Franck Mamalet, François Malgouyres
Binary and sparse ternary weights in neural networks enable faster computations and lighter representations, facilitating their use on edge devices with limited computational power. Meanwhile, vanilla RNNs are highly sensitive to changes in their recurrent weights, making the binarization and ternarization of these weights inherently challenging. To date, no method has successfully achieved binarization or ternarization of vanilla RNN weights. We present a new approach leveraging the properties of Hadamard matrices to parameterize a subset of binary and sparse ternary orthogonal matrices. This method enables the training of orthogonal RNNs (ORNNs) with binary and sparse ternary recurrent weights, effectively creating a specific class of binary and sparse ternary vanilla RNNs. The resulting ORNNs, called HadamRNN and Block-HadamRNN, are evaluated on benchmarks such as the copy task, permuted and sequential MNIST tasks, the IMDB dataset, two GLUE benchmarks, and two IoT benchmarks. Despite binarization or sparse ternarization, these RNNs maintain performance levels comparable to state-of-the-art full-precision models, highlighting the effectiveness of our approach. Notably, our approach is the first solution with binary recurrent weights capable of tackling the copy task over 1000 timesteps.
NEFeb 5, 2024
Quantized Approximately Orthogonal Recurrent Neural NetworksArmand Foucault, Franck Mamalet, François Malgouyres
In recent years, Orthogonal Recurrent Neural Networks (ORNNs) have gained popularity due to their ability to manage tasks involving long-term dependencies, such as the copy-task, and their linear complexity. However, existing ORNNs utilize full precision weights and activations, which prevents their deployment on compact devices.In this paper, we explore the quantization of the weight matrices in ORNNs, leading to Quantized approximately Orthogonal RNNs (QORNNs). The construction of such networks remained an open problem, acknowledged for its inherent instability. We propose and investigate two strategies to learn QORNN by combining quantization-aware training (QAT) and orthogonal projections. We also study post-training quantization of the activations for pure integer computation of the recurrent loop. The most efficient models achieve results similar to state-of-the-art full-precision ORNN, LSTM and FastRNN on a variety of standard benchmarks, even with 4-bits quantization.
STDec 24, 2021
Parameter identifiability of a deep feedforward ReLU neural networkJoachim Bona-Pellissier, François Bachoc, François Malgouyres
The possibility for one to recover the parameters-weights and biases-of a neural network thanks to the knowledge of its function on a subset of the input space can be, depending on the situation, a curse or a blessing. On one hand, recovering the parameters allows for better adversarial attacks and could also disclose sensitive information from the dataset used to construct the network. On the other hand, if the parameters of a network can be recovered, it guarantees the user that the features in the latent spaces can be interpreted. It also provides foundations to obtain formal guarantees on the performances of the network. It is therefore important to characterize the networks whose parameters can be identified and those whose parameters cannot. In this article, we provide a set of conditions on a deep fully-connected feedforward ReLU neural network under which the parameters of the network are uniquely identified-modulo permutation and positive rescaling-from the function it implements on a subset of the input space.
STAug 12, 2021
Existence, Stability and Scalability of Orthogonal Convolutional Neural NetworksEl Mehdi Achour, François Malgouyres, Franck Mamalet
Imposing orthogonality on the layers of neural networks is known to facilitate the learning by limiting the exploding/vanishing of the gradient; decorrelate the features; improve the robustness. This paper studies the theoretical properties of orthogonal convolutional layers.We establish necessary and sufficient conditions on the layer architecture guaranteeing the existence of an orthogonal convolutional transform. The conditions prove that orthogonal convolutional transforms exist for almost all architectures used in practice for 'circular' padding.We also exhibit limitations with 'valid' boundary conditions and 'same' boundary conditions with zero-padding.Recently, a regularization term imposing the orthogonality of convolutional layers has been proposed, and impressive empirical results have been obtained in different applications (Wang et al. 2020).The second motivation of the present paper is to specify the theory behind this.We make the link between this regularization term and orthogonality measures. In doing so, we show that this regularization strategy is stable with respect to numerical and optimization errors and that, in the presence of small errors and when the size of the signal/image is large, the convolutional layers remain close to isometric.The theoretical results are confirmed with experiments and the landscape of the regularization term is studied. Experiments on real data sets show that when orthogonality is used to enforce robustness, the parameter multiplying the regularization termcan be used to tune a tradeoff between accuracy and orthogonality, for the benefit of both accuracy and robustness.Altogether, the study guarantees that the regularization proposed in Wang et al. (2020) is an efficient, flexible and stable numerical strategy to learn orthogonal convolutional layers.
STJul 28, 2021
The loss landscape of deep linear neural networks: a second-order analysisEl Mehdi Achour, François Malgouyres, Sébastien Gerchinovitz
We study the optimization landscape of deep linear neural networks with the square loss. It is known that, under weak assumptions, there are no spurious local minima and no local maxima. However, the existence and diversity of non-strict saddle points, which can play a role in first-order algorithms' dynamics, have only been lightly studied. We go a step further with a full analysis of the optimization landscape at order 2. We characterize, among all critical points, which are global minimizers, strict saddle points, and non-strict saddle points. We enumerate all the associated critical values. The characterization is simple, involves conditions on the ranks of partial matrix products, and sheds some light on global convergence or implicit regularization that have been proved or observed when optimizing linear neural networks. In passing, we provide an explicit parameterization of the set of all global minimizers and exhibit large sets of strict and non-strict saddle points.
LGJan 26, 2021
Overestimation learning with guaranteesAdrien Gauffriau, François Malgouyres, Mélanie Ducoffe
We describe a complete method that learns a neural network which is guaranteed to overestimate a reference function on a given domain. The neural network can then be used as a surrogate for the reference function. The method involves two steps. In the first step, we construct an adaptive set of Majoring Points. In the second step, we optimize a well-chosen neural network to overestimate the Majoring Points. In order to extend the guarantee on the Majoring Points to the whole domain, we necessarily have to make an assumption on the reference function. In this study, we assume that the reference function is monotonic. We provide experiments on synthetic and real problems. The experiments show that the density of the Majoring Points concentrate where the reference function varies. The learned over-estimations are both guaranteed to overestimate the reference function and are proven empirically to provide good approximations of it. Experiments on real data show that the method makes it possible to use the surrogate function in embedded systems for which an underestimation is critical; when computing the reference function requires too many resources.
OCMar 23, 2017
Multilinear compressive sensing and an application to convolutional linear networksFrançois Malgouyres, Joseph Landsberg
We study a deep linear network endowed with a structure. It takes the form of a matrix $X$ obtained by multiplying $K$ matrices (called factors and corresponding to the action of the layers). The action of each layer (i.e. a factor) is obtained by applying a fixed linear operator to a vector of parameters satisfying a constraint. The number of layers is not limited. Assuming that $X$ is given and factors have been estimated, the error between the product of the estimated factors and $X$ (i.e. the reconstruction error) is either the statistical or the empirical risk. In this paper, we provide necessary and sufficient conditions on the network topology under which a stability property holds. The stability property requires that the error on the parameters defining the factors (i.e. the stability of the recovered parameters) scales linearly with the reconstruction error (i.e. the risk). Therefore, under these conditions on the network topology, any successful learning task leads to stably defined features and therefore interpretable layers/network.In order to do so, we first evaluate how the Segre embedding and its inverse distort distances. Then, we show that any deep structured linear network can be cast as a generic multilinear problem (that uses the Segre embedding). This is the {\em tensorial lifting}. Using the tensorial lifting, we provide necessary and sufficient conditions for the identifiability of the factors (up to a scale rearrangement). We finally provide the necessary and sufficient condition called \NSPlong~(because of the analogy with the usual Null Space Property in the compressed sensing framework) which guarantees that the stability property holds. We illustrate the theory with a practical example where the deep structured linear network is a convolutional linear network. As expected, the conditions are rather strong but not empty. A simple test on the network topology can be implemented to test if the condition holds.