Greg Ongie

CV
h-index74
15papers
519citations
Novelty56%
AI Score39

15 Papers

LGJun 30, 2023
The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks

Mor Shpigel Nacson, Rotem Mulayoff, Greg Ongie et al.

We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univariate case, it was shown that linearly stable minima correspond to network functions (predictors), whose second derivative has a bounded weighted $L^1$ norm. Notably, the bound gets smaller as the step size increases, implying that training with a large step size leads to `smoother' predictors. Here we generalize this result to the multivariate case, showing that a similar result applies to the Laplacian of the predictor. We demonstrate the tightness of our bound on the MNIST dataset, and show that it accurately captures the behavior of the solutions as a function of the step size. Additionally, we prove a depth separation result on the approximation power of ReLU networks corresponding to stable minima of the loss. Specifically, although shallow ReLU networks are universal approximators, we prove that stable shallow networks are not. Namely, there is a function that cannot be well-approximated by stable single hidden-layer ReLU networks trained with a non-vanishing step size. This is while the same function can be realized as a stable two hidden-layer ReLU network. Finally, we prove that if a function is sufficiently smooth (in a Sobolev sense) then it can be approximated arbitrarily well using shallow ReLU networks that correspond to stable solutions of gradient descent.

MLNov 12, 2023
How do Minimum-Norm Shallow Denoisers Look in Function Space?

Chen Zeno, Greg Ongie, Yaniv Blumenfeld et al.

Neural network (NN) denoisers are an essential building block in many common tasks, ranging from image reconstruction to image generation. However, the success of these models is not well understood from a theoretical perspective. In this paper, we aim to characterize the functions realized by shallow ReLU NN denoisers -- in the common theoretical setting of interpolation (i.e., zero training loss) with a minimal representation cost (i.e., minimal $\ell^2$ norm weights). First, for univariate data, we derive a closed form for the NN denoiser function, find it is contractive toward the clean data points, and prove it generalizes better than the empirical MMSE estimator at a low noise level. Next, for multivariate data, we find the NN denoiser functions in a closed form under various geometric assumptions on the training data: data contained in a low-dimensional subspace, data contained in a union of one-sided rays, or several types of simplexes. These functions decompose into a sum of simple rank-one piecewise linear interpolations aligned with edges and/or faces connecting training samples. We empirically verify this alignment phenomenon on synthetic data and real images.

MLJun 23, 2025
When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets

Chen Zeno, Hila Manor, Greg Ongie et al.

While diffusion models generate high-quality images via probability flow, the theoretical understanding of this process remains incomplete. A key question is when probability flow converges to training samples or more general points on the data manifold. We analyze this by studying the probability flow of shallow ReLU neural network denoisers trained with minimal $\ell^2$ norm. For intuition, we introduce a simpler score flow and show that for orthogonal datasets, both flows follow similar trajectories, converging to a training point or a sum of training points. However, early stopping by the diffusion time scheduler allows probability flow to reach more general manifold points. This reflects the tendency of diffusion models to both memorize training samples and generate novel points that combine aspects of multiple samples, motivating our study of such behavior in simplified settings. We extend these results to obtuse simplex data and, through simulations in the orthogonal case, confirm that probability flow converges to a training point, a sum of training points, or a manifold point. Moreover, memorization decreases when the number of training samples grows, as fewer samples accumulate near training points.

LGFeb 13, 2024
Depth Separation in Norm-Bounded Infinite-Width Neural Networks

Suzanna Parkinson, Greg Ongie, Rebecca Willett et al.

We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well even when the network width is unbounded. Here, we study separation in terms of the sample complexity required for learnability. Specifically, we show that there are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm). We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network.

LGMay 24, 2023
ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models

Suzanna Parkinson, Greg Ongie, Rebecca Willett

Neural networks often operate in the overparameterized regime, in which there are far more parameters than training samples, allowing the training data to be fit perfectly. That is, training the network effectively learns an interpolating function, and properties of the interpolant affect predictions the network will make on new samples. This manuscript explores how properties of such functions learned by neural networks of depth greater than two layers. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding additional linear layers to the input side of a shallow ReLU network yields a representation cost favoring functions with low mixed variation -- that is, it has limited variation in directions orthogonal to a low-dimensional subspace and can be well approximated by a single- or multi-index model. This bias occurs because minimizing the sum of squared weights of the linear layers is equivalent to minimizing a low-rank promoting Schatten quasi-norm of a single "virtual" weight matrix. Our experiments confirm this behavior in standard network training regimes. They additionally show that linear layers can improve generalization and the learned network is well-aligned with the true latent low-dimensional linear subspace when data is generated using a multi-index model.

LGFeb 2, 2022
The Role of Linear Layers in Nonlinear Interpolating Networks

Greg Ongie, Rebecca Willett

This paper explores the implicit bias of overparameterized neural networks of depth greater than two layers. Our framework considers a family of networks of varying depth that all have the same capacity but different implicitly defined representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding linear layers to a ReLU network yields a representation cost that reflects a complex interplay between the alignment and sparsity of ReLU units. Specifically, using a neural network to fit training data with minimum representation cost yields an interpolating function that is constant in directions perpendicular to a low-dimensional subspace on which a parsimonious interpolant exists.

LGOct 3, 2019
A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case

Greg Ongie, Rebecca Willett, Daniel Soudry et al.

A key element of understanding the efficacy of overparameterized neural networks is characterizing how they represent functions as the number of weights in the network approaches infinity. In this paper, we characterize the norm required to realize a function $f:\mathbb{R}^d\rightarrow\mathbb{R}$ as a single hidden-layer ReLU network with an unbounded number of units (infinite width), but where the Euclidean norm of the weights is bounded, including precisely characterizing which functions can be realized with finite norm. This was settled for univariate univariate functions in Savarese et al. (2019), where it was shown that the required norm is determined by the L1-norm of the second derivative of the function. We extend the characterization to multivariate functions (i.e., networks with d input units), relating the required norm to the L1-norm of the Radon transform of a (d+1)/2-power Laplacian of the function. This characterization allows us to show that all functions in Sobolev spaces $W^{s,1}(\mathbb{R})$, $s\geq d+1$, can be represented with bounded norm, to calculate the required norm for several specific functions, and to obtain a depth separation result. These results have important implications for understanding generalization performance and the distinction between neural networks and more traditional kernel learning.

CVJan 13, 2019
Neumann Networks for Inverse Problems in Imaging

Davis Gilton, Greg Ongie, Rebecca Willett

Many challenging image processing tasks can be described by an ill-posed linear inverse problem: deblurring, deconvolution, inpainting, compressed sensing, and superresolution all lie in this framework. Traditional inverse problem solvers minimize a cost function consisting of a data-fit term, which measures how well an image matches the observations, and a regularizer, which reflects prior knowledge and promotes images with desirable properties like smoothness. Recent advances in machine learning and image processing have illustrated that it is often possible to learn a regularizer from training data that can outperform more traditional regularizers. We present an end-to-end, data-driven method of solving inverse problems inspired by the Neumann series, which we call a Neumann network. Rather than unroll an iterative optimization algorithm, we truncate a Neumann series which directly solves the linear inverse problem with a data-driven nonlinear regularizer. The Neumann network architecture outperforms traditional inverse problem solution methods, model-free deep learning approaches, and state-of-the-art unrolled iterative methods on standard datasets. Finally, when the images belong to a union of subspaces and under appropriate assumptions on the forward model, we prove there exists a Neumann network configuration that well-approximates the optimal oracle estimator for the inverse problem and demonstrate empirically that the trained Neumann network has the form predicted by theory.

MLApr 26, 2018
Tensor Methods for Nonlinear Matrix Completion

Greg Ongie, Daniel Pimentel-Alarcón, Laura Balzano et al.

In the low-rank matrix completion (LRMC) problem, the low-rank assumption means that the columns (or rows) of the matrix to be completed are points on a low-dimensional linear algebraic variety. This paper extends this thinking to cases where the columns are points on a low-dimensional nonlinear algebraic variety, a problem we call Low Algebraic Dimension Matrix Completion (LADMC). Matrices whose columns belong to a union of subspaces are an important special case. We propose a LADMC algorithm that leverages existing LRMC methods on a tensorized representation of the data. For example, a second-order tensorized representation is formed by taking the Kronecker product of each column with itself, and we consider higher order tensorizations as well. This approach will succeed in many cases where traditional LRMC is guaranteed to fail because the data are low-rank in the tensorized representation but not in the original representation. We provide a formal mathematical justification for the success of our method. In particular, we give bounds of the rank of these data in the tensorized representation, and we prove sampling requirements to guarantee uniqueness of the solution. We also provide experimental results showing that the new approach outperforms existing state-of-the-art methods for matrix completion under a union of subspaces model.

NAJun 25, 2017
A Fast Algorithm for Convolutional Structured Low-Rank Matrix Recovery

Greg Ongie, Mathews Jacob

Fourier domain structured low-rank matrix priors are emerging as powerful alternatives to traditional image recovery methods such as total variation and wavelet regularization. These priors specify that a convolutional structured matrix, i.e., Toeplitz, Hankel, or their multi-level generalizations, built from Fourier data of the image should be low-rank. The main challenge in applying these schemes to large-scale problems is the computational complexity and memory demand resulting from lifting the image data to a large scale matrix. We introduce a fast and memory efficient approach called the Generic Iterative Reweighted Annihilation Filter (GIRAF) algorithm that exploits the convolutional structure of the lifted matrix to work in the original un-lifted domain, thus considerably reducing the complexity. Our experiments on the recovery of images from undersampled Fourier measurements show that the resulting algorithm is considerably faster than previously proposed algorithms, and can accommodate much larger problem sizes than previously studied.

MLMar 28, 2017
Algebraic Variety Models for High-Rank Matrix Completion

Greg Ongie, Rebecca Willett, Robert D. Nowak et al.

We consider a generalization of low-rank matrix completion to the case where the data belongs to an algebraic variety, i.e. each data point is a solution to a system of polynomial equations. In this case the original matrix is possibly high-rank, but it becomes low-rank after mapping each column to a higher dimensional space of monomial features. Many well-studied extensions of linear models, including affine subspaces and their union, can be described by a variety model. In addition, varieties can be used to model a richer class of nonlinear quadratic and higher degree curves and surfaces. We study the sampling requirements for matrix completion under a variety model with a focus on a union of affine subspaces. We also propose an efficient matrix completion algorithm that minimizes a convex or non-convex surrogate of the rank of the matrix of monomial features. Our algorithm uses the well-known "kernel trick" to avoid working directly with the high-dimensional monomial matrix. We show the proposed algorithm is able to recover synthetically generated data up to the predicted sampling complexity bounds. The proposed algorithm also outperforms standard low rank matrix completion and subspace clustering techniques in experiments with real data.

CVOct 1, 2015
Off-the-Grid Recovery of Piecewise Constant Images from Few Fourier Samples

Greg Ongie, Mathews Jacob

We introduce a method to recover a continuous domain representation of a piecewise constant two-dimensional image from few low-pass Fourier samples. Assuming the edge set of the image is localized to the zero set of a trigonometric polynomial, we show the Fourier coefficients of the partial derivatives of the image satisfy a linear annihilation relation. We present necessary and sufficient conditions for unique recovery of the image from finite low-pass Fourier samples using the annihilation relation. We also propose a practical two-stage recovery algorithm which is robust to model-mismatch and noise. In the first stage we estimate a continuous domain representation of the edge set of the image. In the second stage we perform an extrapolation in Fourier domain by a least squares two-dimensional linear prediction, which recovers the exact Fourier coefficients of the underlying image. We demonstrate our algorithm on the super-resolution recovery of MRI phantoms and real MRI data from low-pass Fourier samples, which shows benefits over standard approaches for single-image super-resolution MRI.

CVFeb 3, 2015
Recovery of Piecewise Smooth Images from Few Fourier Samples

Greg Ongie, Mathews Jacob

We introduce a Prony-like method to recover a continuous domain 2-D piecewise smooth image from few of its Fourier samples. Assuming the discontinuity set of the image is localized to the zero level-set of a trigonometric polynomial, we show the Fourier transform coefficients of partial derivatives of the signal satisfy an annihilation relation. We present necessary and sufficient conditions for unique recovery of piecewise constant images using the above annihilation relation. We pose the recovery of the Fourier coefficients of the signal from the measurements as a convex matrix completion algorithm, which relies on the lifting of the Fourier data to a structured low-rank matrix; this approach jointly estimates the signal and the annihilating filter. Finally, we demonstrate our algorithm on the recovery of MRI phantoms from few low-resolution Fourier samples.

CVJan 8, 2015
Super-resolution MRI Using Finite Rate of Innovation Curves

Greg Ongie, Mathews Jacob

We propose a two-stage algorithm for the super-resolution of MR images from their low-frequency k-space samples. In the first stage we estimate a resolution-independent mask whose zeros represent the edges of the image. This builds off recent work extending the theory of sampling signals of finite rate of innovation (FRI) to two-dimensional curves. We enable its application to MRI by proposing extensions of the signal models allowed by FRI theory, and by developing a more robust and efficient means to determine the edge mask. In the second stage of the scheme, we recover the super-resolved MR image using the discretized edge mask as an image prior. We evaluate our scheme on simulated single-coil MR data obtained from analytical phantoms, and compare against total variation reconstructions. Our experiments show improved performance in both noiseless and noisy settings.

CVMay 15, 2014
Iterative Non-Local Shrinkage Algorithm for MR Image Reconstruction

Yasir Q. Moshin, Greg Ongie, Mathews Jacob

We introduce a fast iterative non-local shrinkage algorithm to recover MRI data from undersampled Fourier measurements. This approach is enabled by the reformulation of current non-local schemes as an alternating algorithm to minimize a global criterion. The proposed algorithm alternates between a non-local shrinkage step and a quadratic subproblem. We derive analytical shrinkage rules for several penalties that are relevant in non-local regularization. The redundancy in the searches used to evaluate the shrinkage steps are exploited using filtering operations. The resulting algorithm is observed to be considerably faster than current alternating non-local algorithms. The comparisons of the proposed scheme with state-of-the-art regularization schemes show a considerable reduction in alias artifacts and preservation of edges.