Yehuda Dar

LG
h-index8
19papers
373citations
Novelty43%
AI Score42

19 Papers

LGMar 15, 2022Code
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective

Gowthami Somepalli, Liam Fowl, Arpit Bansal et al.

We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield results with strong similarities, especially in the case of wide architectures. We also use decision boundary methods to visualize double descent phenomena. We see that decision boundary reproducibility depends strongly on model width. Near the threshold of interpolation, neural network decision boundaries become fragmented into many small decision regions, and these regions are non-reproducible. Meanwhile, very narrows and very wide networks have high levels of reproducibility in their decision boundaries with relatively few decision regions. We discuss how our observations relate to the theory of double descent phenomena in convex models. Code is available at https://github.com/somepago/dbViz

LGNov 20, 2022
Frozen Overparameterization: A Double Descent Perspective on Transfer Learning of Deep Neural Networks

Yehuda Dar, Lorenzo Luzi, Richard G. Baraniuk

We study the generalization behavior of transfer learning of deep neural networks (DNNs). We adopt the overparameterization perspective -- featuring interpolation of the training data (i.e., approximately zero train error) and the double descent phenomenon -- to explain the delicate effect of the transfer learning setting on generalization performance. We study how the generalization behavior of transfer learning is affected by the dataset size in the source and target tasks, the number of transferred layers that are kept frozen in the target DNN training, and the similarity between the source and target tasks. We show that the test error evolution during the target DNN training has a more significant double descent effect when the target training dataset is sufficiently large. In addition, a larger source training dataset can yield a slower target DNN training. Moreover, we demonstrate that the number of frozen layers can determine whether the transfer learning is effectively underparameterized or overparameterized and, in turn, this may induce a freezing-wise double descent phenomenon that determines the relative success or failure of learning. Also, we show that the double descent phenomenon may make a transfer from a less related source task better than a transfer from a more related source task. We establish our results using image classification experiments with the ResNet, DenseNet and the vision transformer (ViT) architectures.

LGOct 4, 2023
How Much Training Data is Memorized in Overparameterized Autoencoders? An Inverse Problem Perspective on Memorization Evaluation

Koren Abitbul, Yehuda Dar

Overparameterized autoencoder models often memorize their training data. For image data, memorization is often examined by using the trained autoencoder to recover missing regions in its training images (that were used only in their complete forms in the training). In this paper, we propose an inverse problem perspective for the study of memorization. Given a degraded training image, we define the recovery of the original training image as an inverse problem and formulate it as an optimization task. In our inverse problem, we use the trained autoencoder to implicitly define a regularizer for the particular training dataset that we aim to retrieve from. We develop the intricate optimization task into a practical method that iteratively applies the trained autoencoder and relatively simple computations that estimate and address the unknown degradation operator. We evaluate our method for blind inpainting where the goal is to recover training images from degradation of many missing pixels in an unknown pattern. We examine various deep autoencoder architectures, such as fully connected and U-Net (with various nonlinearities and at diverse train loss values), and show that our method significantly outperforms previous memorization-evaluation methods that recover training data from autoencoders. Importantly, our method greatly improves the recovery performance also in settings that were previously considered highly challenging, and even impractical, for such recovery and memorization evaluation.

LGFeb 18
Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

Daniel Boharon, Yehuda Dar

We study transfer learning for a linear regression task using several least-squares pretrained models that can be overparameterized. We formulate the target learning task as optimization that minimizes squared errors on the target dataset with penalty on the distance of the learned model from the pretrained models. We analytically formulate the test error of the learned target model and provide the corresponding empirical evaluations. Our results elucidate when using more pretrained models can improve transfer learning. Specifically, if the pretrained models are overparameterized, using sufficiently many of them is important for beneficial transfer learning. However, the learning may be compromised by overparameterization bias of pretrained models, i.e., the minimum $\ell_2$-norm solution's restriction to a small subspace spanned by the training examples in the high-dimensional parameter space. We propose a simple debiasing via multiplicative correction factor that can reduce the overparameterization bias and leverage more pretrained models to learn a target predictor.

LGMar 11, 2025
How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

Gal Alon, Yehuda Dar

Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch. In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width. We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples. Our results show that unlearning excels on overparameterized models, in terms of balancing between generalization and achieving the unlearning goal; although for bias removal this requires the unlearning method to use the unlearned examples. We further elucidate our error-based analysis by measuring how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

LGOct 14, 2024
TL-PCA: Transfer Learning of Principal Component Analysis

Sharon Hendy, Yehuda Dar

Principal component analysis (PCA) can be significantly limited when there is too few examples of the target data of interest. We propose a transfer learning approach to PCA (TL-PCA) where knowledge from a related source task is used in addition to the scarce data of a target task. Our TL-PCA has two versions, one that uses a pretrained PCA solution of the source task, and another that uses the source data. Our proposed approach extends the PCA optimization objective with a penalty on the proximity of the target subspace and the source subspace as given by the pretrained source model or the source data. This optimization is solved by eigendecomposition for which the number of data-dependent eigenvectors (i.e., principal directions of TL-PCA) is not limited to the number of target data examples, which is a root cause that limits the standard PCA performance. Accordingly, our results for image datasets show that the representation of test data is improved by TL-PCA for dimensionality reduction where the learned subspace dimension is lower or higher than the number of target data examples.

LGOct 3, 2025
Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective

Yehuda Dar

This paper uses classical high-rate quantization theory to provide new insights into mixture-of-experts (MoE) models for regression tasks. Our MoE is defined by a segmentation of the input space to regions, each with a single-parameter expert that acts as a constant predictor with zero-compute at inference. Motivated by high-rate quantization theory assumptions, we assume that the number of experts is sufficiently large to make their input-space regions very small. This lets us to study the approximation error of our MoE model class: (i) for one-dimensional inputs, we formulate the test error and its minimizing segmentation and experts; (ii) for multidimensional inputs, we formulate an upper bound for the test error and study its minimization. Moreover, we consider the learning of the expert parameters from a training dataset, given an input-space segmentation, and formulate their statistical learning properties. This leads us to theoretically and empirically show how the tradeoff between approximation and estimation errors in MoE learning depends on the number of experts.

MLSep 6, 2021
A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

Yehuda Dar, Vidya Muthukumar, Richard G. Baraniuk

The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models -- from simple linear models to deep neural networks -- have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance. Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.

LGJun 7, 2021
Double Descent and Other Interpolation Phenomena in GANs

Lorenzo Luzi, Yehuda Dar, Richard Baraniuk

We study overparameterization in generative adversarial networks (GANs) that can interpolate the training data. We show that overparameterization can improve generalization performance and accelerate the training process. We study the generalization error as a function of latent space dimension and identify two main behaviors, depending on the learning setting. First, we show that overparameterized generative models that learn distributions by minimizing a metric or $f$-divergence do not exhibit double descent in generalization errors; specifically, all the interpolating solutions achieve the same generalization error. Second, we develop a novel pseudo-supervised learning approach for GANs where the training utilizes pairs of fabricated (noise) inputs in conjunction with real output samples. Our pseudo-supervised setting exhibits double descent (and in some cases, triple descent) of generalization errors. We combine pseudo-supervision with overparameterization (i.e., overly large latent space dimension) to accelerate training while matching or even surpassing generalization performance without pseudo-supervision. While our analysis focuses mostly on linear models, we also apply important insights for improving generalization of nonlinear, multilayer GANs.

LGMar 9, 2021
The Common Intuition to Transfer Learning Can Win or Lose: Case Studies for Linear Regression

Yehuda Dar, Daniel LeJeune, Richard G. Baraniuk

We study a fundamental transfer learning process from source to target linear regression tasks, including overparameterized settings where there are more learned parameters than data samples. The target task learning is addressed by using its training data together with the parameters previously computed for the source task. We define a transfer learning approach to the target task as a linear regression optimization with a regularization on the distance between the to-be-learned target parameters and the already-learned source parameters. We analytically characterize the generalization performance of our transfer learning approach and demonstrate its ability to resolve the peak in generalization errors in double descent phenomena of the minimum L2-norm solution to linear regression. Moreover, we show that for sufficiently related tasks, the optimally tuned transfer learning approach can outperform the optimally tuned ridge regression method, even when the true parameter vector conforms to an isotropic Gaussian prior distribution. Namely, we demonstrate that transfer learning can beat the minimum mean square error (MMSE) solution of the independent target task. Our results emphasize the ability of transfer learning to extend the solution space to the target task and, by that, to have an improved MMSE solution. We formulate the linear MMSE solution to our transfer learning setting and point out its key differences from the common design philosophy to transfer learning.

IVOct 8, 2020
Regularized Compression of MRI Data: Modular Optimization of Joint Reconstruction and Coding

Veronica Corona, Yehuda Dar, Guy Williams et al.

The Magnetic Resonance Imaging (MRI) processing chain starts with a critical acquisition stage that provides raw data for reconstruction of images for medical diagnosis. This flow usually includes a near-lossless data compression stage that enables digital storage and/or transmission in binary formats. In this work we propose a framework for joint optimization of the MRI reconstruction and lossy compression, producing compressed representations of medical images that achieve improved trade-offs between quality and bit-rate. Moreover, we demonstrate that lossy compression can even improve the reconstruction quality compared to settings based on lossless compression. Our method has a modular optimization structure, implemented using the alternating direction method of multipliers (ADMM) technique and the state-of-the-art image compression technique (BPG) as a black-box module iteratively applied. This establishes a medical data compression approach compatible with a lossy compression standard of choice. A main novelty of the proposed algorithm is in the total-variation regularization added to the modular compression process, leading to decompressed images of higher quality without any additional processing at/after the decompression stage. Our experiments show that our regularization-based approach for joint MRI reconstruction and compression often achieves significant PSNR gains between 4 to 9 dB at high bit-rates compared to non-regularized solutions of the joint task. Compared to regularization-based solutions, our optimization method provides PSNR gains between 0.5 to 1 dB at high bit-rates, which is the range of interest for medical image compression.

LGJun 12, 2020
Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks

Yehuda Dar, Richard G. Baraniuk

We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically characterize the generalization error of the target task in terms of the salient factors in the transfer learning architecture, i.e., the number of examples available, the number of (free) parameters in each of the tasks, the number of parameters transferred from the source to target task, and the relation between the two tasks. Our non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors. Our analysis points to specific cases where the transfer of parameters is beneficial as a substitute for extra overparameterization (i.e., additional free parameters in the target task). Specifically, we show that the usefulness of a transfer learning setting is fragile and depends on a delicate interplay among the set of transferred parameters, the relation between the tasks, and the true solution. We also demonstrate that overparameterized transfer learning is not necessarily more beneficial when the source task is closer or identical to the target task.

LGFeb 25, 2020
Subspace Fitting Meets Regression: The Effects of Supervision and Orthonormality Constraints on Double Descent of Generalization Errors

Yehuda Dar, Paul Mayer, Lorenzo Luzi et al.

We study the linear subspace fitting problem in the overparameterized setting, where the estimated subspace can perfectly interpolate the training examples. Our scope includes the least-squares solutions to subspace fitting tasks with varying levels of supervision in the training data (i.e., the proportion of input-output examples of the desired low-dimensional mapping) and orthonormality of the vectors defining the learned operator. This flexible family of problems connects standard, unsupervised subspace fitting that enforces strict orthonormality with a corresponding regression task that is fully supervised and does not constrain the linear operator structure. This class of problems is defined over a supervision-orthonormality plane, where each coordinate induces a problem instance with a unique pair of supervision level and softness of orthonormality constraints. We explore this plane and show that the generalization errors of the corresponding subspace fitting problems follow double descent trends as the settings become more supervised and less orthonormally constrained.

MMJan 30, 2019
Benefiting from Duplicates of Compressed Data: Shift-Based Holographic Compression of Images

Yehuda Dar, Alfred M. Bruckstein

Storage systems often rely on multiple copies of the same compressed data, enabling recovery in case of binary data errors, of course, at the expense of a higher storage cost. In this paper we show that a wiser method of duplication entails great potential benefits for data types tolerating approximate representations, like images and videos. We propose a method to produce a set of distinct compressed representations for a given signal, such that any subset of them allows reconstruction of the signal at a quality depending only on the number of compressed representations utilized. Essentially, we implement the holographic representation idea, where all the representations are equally important in refining the reconstruction. Here we propose to exploit the shift sensitivity of common compression processes and generate holographic representations via compression of various shifts of the signal. Two implementations for the idea, based on standard compression methods, are presented: the first is a simple, optimization-free design. The second approach originates in a challenging rate-distortion optimization, mitigated by the alternating direction method of multipliers (ADMM), leading to a process of repeatedly applying standard compression techniques. Evaluation of the approach, in conjunction with the JPEG2000 image compression standard, shows the effectiveness of the optimization in providing compressed holographic representations that, by means of an elementary reconstruction process, enable impressive gains of several dBs in PSNR over exact duplications.

MMFeb 12, 2018
Compression for Multiple Reconstructions

Yehuda Dar, Michael Elad, Alfred M. Bruckstein

In this work we propose a method for optimizing the lossy compression for a network of diverse reconstruction systems. We focus on adapting a standard image compression method to a set of candidate displays, presenting the decompressed signals to viewers. Each display is modeled as a linear operator applied after decompression, and its probability to serve a network user. We formulate a complicated operational rate-distortion optimization trading-off the network's expected mean-squared reconstruction error and the compression bit-cost. Using the alternating direction method of multipliers (ADMM) we develop an iterative procedure where the network structure is separated from the compression method, enabling the reliance on standard compression techniques. We present experimental results showing our method to be the best approach for adjusting high bit-rate image compression (using the state-of-the-art HEVC standard) to a set of displays modeled as blur degradations.

MMNov 21, 2017
Optimized Pre-Compensating Compression

Yehuda Dar, Michael Elad, Alfred M. Bruckstein

In imaging systems, following acquisition, an image/video is transmitted or stored and eventually presented to human observers using different and often imperfect display devices. While the resulting quality of the output image may severely be affected by the display, this degradation is usually ignored in the preceding compression. In this paper we model the sub-optimality of the display device as a known degradation operator applied on the decompressed image/video. We assume the use of a standard compression path, and augment it with a suitable pre-processing procedure, providing a compressed signal intended to compensate the degradation without any post-filtering. Our approach originates from an intricate rate-distortion problem, optimizing the modifications to the input image/video for reaching best end-to-end performance. We address this seemingly computationally intractable problem using the alternating direction method of multipliers (ADMM) approach, leading to a procedure in which a standard compression technique is iteratively applied. We demonstrate the proposed method for adjusting HEVC image/video compression to compensate post-decompression visual effects due to a common type of displays. Particularly, we use our method to reduce motion-blur perceived while viewing video on LCD devices. The experiments establish our method as a leading approach for preprocessing high bit-rate compression to counterbalance a post-decompression degradation.

CVOct 30, 2015
Postprocessing of Compressed Images via Sequential Denoising

Yehuda Dar, Alfred M. Bruckstein, Michael Elad et al.

In this work we propose a novel postprocessing technique for compression-artifact reduction. Our approach is based on posing this task as an inverse problem, with a regularization that leverages on existing state-of-the-art image denoising algorithms. We rely on the recently proposed Plug-and-Play Prior framework, suggesting the solution of general inverse problems via Alternating Direction Method of Multipliers (ADMM), leading to a sequence of Gaussian denoising steps. A key feature in our scheme is a linearization of the compression-decompression process, so as to get a formulation that can be optimized. In addition, we supply a thorough analysis of this linear approximation for several basic compression procedures. The proposed method is suitable for diverse compression techniques that rely on transform coding. Specifically, we demonstrate impressive gains in image quality for several leading compression methods - JPEG, JPEG2000, and HEVC.

MMApr 15, 2014
Improving Low Bit-Rate Video Coding using Spatio-Temporal Down-Scaling

Yehuda Dar, Alfred M. Bruckstein

Good quality video coding for low bit-rate applications is important for transmission over narrow-bandwidth channels and for storage with limited memory capacity. In this work, we develop a previous analysis for image compression at low bit-rates to adapt it to video signals. Improving compression using down-scaling in the spatial and temporal dimensions is examined. We show, both theoretically and experimentally, that at low bit-rates, we benefit from applying spatio-temporal scaling. The proposed method includes down-scaling before the compression and a corresponding up-scaling afterwards, while the codec itself is left unmodified. We propose analytic models for low bit-rate compression and spatio-temporal scaling operations. Specifically, we use theoretic models of motion-compensated prediction of available and absent frames as in coding and frame-rate up-conversion (FRUC) applications, respectively. The proposed models are designed for multi-resolution analysis. In addition, we formulate a bit-allocation procedure and propose a method for estimating good down-scaling factors of a given video based on its second-order statistics and the given bit-budget. We validate our model with experimental results of H.264 compression.

MMApr 12, 2014
Motion-Compensated Coding and Frame-Rate Up-Conversion: Models and Analysis

Yehuda Dar, Alfred M. Bruckstein

Block-based motion estimation (ME) and compensation (MC) techniques are widely used in modern video processing algorithms and compression systems. The great variety of video applications and devices results in numerous compression specifications. Specifically, there is a diversity of frame-rates and bit-rates. In this paper, we study the effect of frame-rate and compression bit-rate on block-based ME and MC as commonly utilized in inter-frame coding and frame-rate up conversion (FRUC). This joint examination yields a comprehensive foundation for comparing MC procedures in coding and FRUC. First, the video signal is modeled as a noisy translational motion of an image. Then, we theoretically model the motion-compensated prediction of an available and absent frames as in coding and FRUC applications, respectively. The theoretic MC-prediction error is further analyzed and its autocorrelation function is calculated for coding and FRUC applications. We show a linear relation between the variance of the MC-prediction error and temporal-distance. While the affecting distance in MC-coding is between the predicted and reference frames, MC-FRUC is affected by the distance between the available frames used for the interpolation. Moreover, the dependency in temporal-distance implies an inverse effect of the frame-rate. FRUC performance analysis considers the prediction error variance, since it equals to the mean-squared-error of the interpolation. However, MC-coding analysis requires the entire autocorrelation function of the error; hence, analytic simplicity is beneficial. Therefore, we propose two constructions of a separable autocorrelation function for prediction error in MC-coding. We conclude by comparing our estimations with experimental results.