LGMay 23, 2025
PLUMAGE: Probabilistic Low rank Unbiased Min Variance Gradient Estimator for Efficient Large Model TrainingMatan Haroush, Daniel Soudry
Accelerator memory and networking constraints have emerged as dominant bottlenecks when training large language models LLMs with billions of parameters. Existing low rank gradient estimators such as GaLoRE and FLORA compress gradients and optimizer tensors by projecting weight gradients onto a rank r subspace, enabling LLM training on consumer hardware. Yet, these methods are either biased or subject to high estimator variance. Moreover, the optimizer state based on the first and second moments estimates expressed in the previous subspace becomes misaligned whenever the projection is updated, leading to instabilities during training. We propose PLUMAGE: Probabilistic Low rank Unbiased Minimum vAriance Gradient Estimator. PLUMAGE is a drop in replacement for existing low rank gradient estimators. It does not introduce new hyperparameters beyond the chosen rank r and the update interval. In addition, we resolve optimizer state misalignment issues to prevent spurious weight updates and enhance training stability. We empirically demonstrate that PLUMAGE shrinks the full rank optimization's gap over the pre training evaluation loss by 33% on average across models and the average training loss across the GLUE benchmark by 28% within a similar computational and memory footprint as GaloRE.
LGFeb 25, 2021
A statistical framework for efficient out of distribution detection in deep neural networksMatan Haroush, Tzviel Frostig, Ruth Heller et al.
Background. Commonly, Deep Neural Networks (DNNs) generalize well on samples drawn from a distribution similar to that of the training set. However, DNNs' predictions are brittle and unreliable when the test samples are drawn from a dissimilar distribution. This is a major concern for deployment in real-world applications, where such behavior may come at a considerable cost, such as industrial production lines, autonomous vehicles, or healthcare applications. Contributions. We frame Out Of Distribution (OOD) detection in DNNs as a statistical hypothesis testing problem. Tests generated within our proposed framework combine evidence from the entire network. Unlike previous OOD detection heuristics, this framework returns a $p$-value for each test sample. It is guaranteed to maintain the Type I Error (T1E - incorrectly predicting OOD for an actual in-distribution sample) for test data. Moreover, this allows to combine several detectors while maintaining the T1E. Building on this framework, we suggest a novel OOD procedure based on low-order statistics. Our method achieves comparable or better results than state-of-the-art methods on well-accepted OOD benchmarks, without retraining the network parameters or assuming prior knowledge on the test distribution -- and at a fraction of the computational cost.
LGDec 3, 2019
The Knowledge Within: Methods for Data-Free Model CompressionMatan Haroush, Itay Hubara, Elad Hoffer et al.
Recently, an extensive amount of research has been focused on compressing and accelerating Deep Neural Networks (DNN). So far, high compression rate algorithms require part of the training dataset for a low precision calibration, or a fine-tuning process. However, this requirement is unacceptable when the data is unavailable or contains sensitive information, as in medical and biometric use-cases. We present three methods for generating synthetic samples from trained models. Then, we demonstrate how these samples can be used to calibrate and fine-tune quantized models without using any real data in the process. Our best performing method has a negligible accuracy degradation compared to the original training set. This method, which leverages intrinsic batch normalization layers' statistics of the trained model, can be used to evaluate data similarity. Our approach opens a path towards genuine data-free model compression, alleviating the need for training data during model deployment.
LGSep 6, 2018
Learn What Not to Learn: Action Elimination with Deep Reinforcement LearningTom Zahavy, Matan Haroush, Nadav Merlis et al.
Learning how to act when there are many available actions in each state is a challenging task for Reinforcement Learning (RL) agents, especially when many of the actions are redundant or irrelevant. In such cases, it is sometimes easier to learn which actions not to take. In this work, we propose the Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions. The AEN is trained to predict invalid actions, supervised by an external elimination signal provided by the environment. Simulations demonstrate a considerable speedup and added robustness over vanilla DQN in text-based games with over a thousand discrete actions.