LGJun 6, 2023
Continual Learning in Linear Classification on Separable DataItay Evron, Edward Moroshko, Gon Buzaglo et al.
We analyze continual learning on a sequence of separable linear classification tasks with binary labels. We show theoretically that learning with weak regularization reduces to solving a sequential max-margin problem, corresponding to a special case of the Projection Onto Convex Sets (POCS) framework. We then develop upper bounds on the forgetting and other quantities of interest under various settings with recurring tasks, including cyclic and random orderings of tasks. We discuss several practical implications to popular training practices like regularization scheduling and weighting. We point out several theoretical differences between our continual classification setting and a recently studied continual regression setting.
LGMay 19, 2022
How catastrophic can catastrophic forgetting be in linear regression?Itay Evron, Edward Moroshko, Rachel Ward et al.
To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.
65.2LGMar 13
A Causal Framework for Mitigating Data Shifts in HealthcareKurt Butler, Stephanie Riley, Damian Machlanski et al.
Developing predictive models that perform reliably across diverse patient populations and heterogeneous environments is a core aim of medical research. However, generalization is only possible if the learned model is robust to statistical differences between data used for training and data seen at the time and place of deployment. Domain generalization methods provide strategies to address data shifts, but each method comes with its own set of assumptions and trade-offs. To apply these methods in healthcare, we must understand how domain shifts arise, what assumptions we prefer to make, and what our design constraints are. This article proposes a causal framework for the design of predictive models to improve generalization. Causality provides a powerful language to characterize and understand diverse domain shifts, regardless of data modality. This allows us to pinpoint why models fail to generalize, leading to more principled strategies to prepare for and adapt to shifts. We recommend general mitigation strategies, discussing trade-offs and highlighting existing work. Our causality-based perspective offers a critical foundation for developing robust, interpretable, and clinically relevant AI solutions in healthcare, paving the way for reliable real-world deployment.
LGJan 20
Optimal L2 Regularization in High-dimensional Continual Linear RegressionGilad Karpel, Edward Moroshko, Ran Levinstein et al.
We study generalization in an overparameterized continual linear regression setting, where a model is trained with L2 (isotropic) regularization across a sequence of tasks. We derive a closed-form expression for the expected generalization loss in the high-dimensional regime that holds for arbitrary linear teachers. We demonstrate that isotropic regularization mitigates label noise under both single-teacher and multiple i.i.d. teacher settings, whereas prior work accommodating multiple teachers either did not employ regularization or used memory-demanding methods. Furthermore, we prove that the optimal fixed regularization strength scales nearly linearly with the number of tasks $T$, specifically as $T/\ln T$. To our knowledge, this is the first such result in theoretical continual learning. Finally, we validate our theoretical findings through experiments on linear regression and neural networks, illustrating how this scaling law affects generalization and offering a practical recipe for the design of continual learning systems.
CVMar 18, 2025
CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion ModelsYuyang Xue, Edward Moroshko, Feng Chen et al.
Text-to-Image diffusion models can produce undesirable content that necessitates concept erasure. However, existing methods struggle with under-erasure, leaving residual traces of targeted concepts, or over-erasure, mistakenly eliminating unrelated but visually similar concepts. To address these limitations, we introduce CRCE, a novel concept erasure framework that leverages Large Language Models to identify both semantically related concepts that should be erased alongside the target and distinct concepts that should be preserved. By explicitly modelling coreferential and retained concepts semantically, CRCE enables more precise concept removal, without unintended erasure. Experiments demonstrate that CRCE outperforms existing methods on diverse erasure tasks, including real-world object, person identities, and abstract intellectual property characteristics. The constructed dataset CorefConcept and the source code will be release upon acceptance.
LGAug 18, 2025
A Shift in Perspective on Causality in Domain GeneralizationDamian Machlanski, Stephanie Riley, Edward Moroshko et al.
The promise that causal modelling can lead to robust AI generalization has been challenged in recent work on domain generalization (DG) benchmarks. We revisit the claims of the causality and DG literature, reconciling apparent contradictions and advocating for a more nuanced theory of the role of causality in generalization. We also provide an interactive demo at https://chai-uk.github.io/ukairs25-causal-predictors/.
LGFeb 19, 2021
On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror DescentShahar Azulay, Edward Moroshko, Mor Shpigel Nacson et al.
Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradient-flow and use it to obtain closed-form implicit regularizers for multiple cases of interest.
LGJul 13, 2020
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training AccuracyEdward Moroshko, Suriya Gunasekar, Blake Woodworth et al.
We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies (well beyond $10^{-100}$). Moreover, the implicit bias at reasonable initialization scales and training accuracies is more complex and not captured by these limits.
LGFeb 20, 2020
Kernel and Rich Regimes in Overparametrized ModelsBlake Woodworth, Suriya Gunasekar, Jason D. Lee et al.
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We also highlight an interesting role for the width of a model in the case that the predictor is not identically zero at initialization. We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.
LGJun 13, 2019
Kernel and Rich Regimes in Overparametrized ModelsBlake Woodworth, Suriya Gunasekar, Pedro Savarese et al.
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a simple two-layer model that already exhibits an interesting and meaningful transition between the kernel and rich regimes, and we demonstrate the transition for more complex matrix factorization models and multilayer non-linear networks.
LGJun 13, 2019
Finite Sample Analysis Of Dynamic Regression Parameter LearningMark Kozdoba, Edward Moroshko, Shie Mannor et al.
We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees altogether, or only have asymptotic guarantees without explicit rates. In particular, existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the convergence rate depend on? In this paper we study the global system operator -- the operator that maps the noise vectors to the output. We obtain estimates on its spectrum, and as a result derive the first known variance estimators with finite sample complexity guarantees. The proposed bounds depend on the shape of a certain spectrum related to the system operator, and thus provide the first known explicit geometric parameter of the data that can be used to bound estimation errors. In addition, the results hold for arbitrary sub Gaussian distributions of noise terms. We evaluate the approach on synthetic and real-world benchmarks.
CLFeb 27, 2019
An Editorial Network for Enhanced Document SummarizationEdward Moroshko, Guy Feigenblat, Haggai Roitman et al.
We suggest a new idea of Editorial Network - a mixed extractive-abstractive summarization approach, which is applied as a post-processing step over a given sequence of extracted sentences. Our network tries to imitate the decision process of a human editor during summarization. Within such a process, each extracted sentence may be either kept untouched, rephrased or completely rejected. We further suggest an effective way for training the "editor" based on a novel soft-labeling approach. Using the CNN/DailyMail dataset we demonstrate the effectiveness of our approach compared to state-of-the-art extractive-only or abstractive-only baseline methods.
LGDec 17, 2018
Multi Instance Learning For Unbalanced DataMark Kozdoba, Edward Moroshko, Lior Shani et al.
In the context of Multi Instance Learning, we analyze the Single Instance (SI) learning objective. We show that when the data is unbalanced and the family of classifiers is sufficiently rich, the SI method is a useful learning algorithm. In particular, we show that larger data imbalance, a quality that is typically perceived as negative, in fact implies a better resilience of the algorithm to the statistical dependencies of the objects in bags. In addition, our results shed new light on some known issues with the SI method in the setting of linear classifiers, and we show that these issues are significantly less likely to occur in the setting of neural networks. We demonstrate our results on a synthetic dataset, and on the COCO dataset for the problem of patch classification with weak image level labels derived from captions.
LGMar 8, 2018
Efficient Loss-Based Decoding on Graphs For Extreme ClassificationItay Evron, Edward Moroshko, Koby Crammer
In extreme classification problems, learning algorithms are required to map instances to labels from an extremely large label set. We build on a recent extreme classification framework with logarithmic time and space, and on a general approach for error correcting output coding (ECOC) with loss-based decoding, and introduce a flexible and efficient approach accompanied by theoretical bounds. Our framework employs output codes induced by graphs, for which we show how to perform efficient loss-based decoding to potentially improve accuracy. In addition, our framework offers a tradeoff between accuracy, model size and prediction time. We show how to find the sweet spot of this tradeoff using only the training data. Our experimental study demonstrates the validity of our assumptions and claims, and shows that our method is competitive with state-of-the-art algorithms.
LGFeb 17, 2014
Selective Sampling with DriftEdward Moroshko, Koby Crammer
Recently there has been much work on selective sampling, an online active learning setting, in which algorithms work in rounds. On each round an algorithm receives an input and makes a prediction. Then, it can decide whether to query a label, and if so to update its model, otherwise the input is discarded. Most of this work is focused on the stationary case, where it is assumed that there is a fixed target model, and the performance of the algorithm is compared to a fixed model. However, in many real-world applications, such as spam prediction, the best target function may drift over time, or have shifts from time to time. We develop a novel selective sampling algorithm for the drifting setting, analyze it under no assumptions on the mechanism generating the sequence of instances, and derive new mistake bounds that depend on the amount of drift in the problem. Simulations on synthetic and real-world datasets demonstrate the superiority of our algorithms as a selective sampling algorithm in the drifting setting.
LGMar 15, 2013
A Last-Step Regression Algorithm for Non-Stationary Online LearningEdward Moroshko, Koby Crammer
The goal of a learner in standard online learning is to maintain an average loss close to the loss of the best-performing single function in some class. In many real-world problems, such as rating or ranking items, there is no single best target function during the runtime of the algorithm, instead the best (local) target function is drifting over time. We develop a novel last-step minmax optimal algorithm in context of a drift. We analyze the algorithm in the worst-case regret framework and show that it maintains an average loss close to that of the best slowly changing sequence of linear functions, as long as the total of drift is sublinear. In some situations, our bound improves over existing bounds, and additionally the algorithm suffers logarithmic regret when there is no drift. We also build on the H_infinity filter and its bound, and develop and analyze a second algorithm for drifting setting. Synthetic simulations demonstrate the advantages of our algorithms in a worst-case constant drift setting.
LGMar 1, 2013
Second-Order Non-Stationary Online Learning for RegressionNina Vaits, Edward Moroshko, Koby Crammer
The goal of a learner, in standard online learning, is to have the cumulative loss not much larger compared with the best-performing function from some fixed class. Numerous algorithms were shown to have this gap arbitrarily close to zero, compared with the best function that is chosen off-line. Nevertheless, many real-world applications, such as adaptive filtering, are non-stationary in nature, and the best prediction function may drift over time. We introduce two novel algorithms for online regression, designed to work well in non-stationary environment. Our first algorithm performs adaptive resets to forget the history, while the second is last-step min-max optimal in context of a drift. We analyze both algorithms in the worst-case regret framework and show that they maintain an average loss close to that of the best slowly changing sequence of linear functions, as long as the cumulative drift is sublinear. In addition, in the stationary case, when no drift occurs, our algorithms suffer logarithmic regret, as for previous algorithms. Our bounds improve over the existing ones, and simulations demonstrate the usefulness of these algorithms compared with other state-of-the-art approaches.
LGJan 25, 2013
Weighted Last-Step Min-Max Algorithm with Improved Sub-Logarithmic RegretEdward Moroshko, Koby Crammer
In online learning the performance of an algorithm is typically compared to the performance of a fixed function from some class, with a quantity called regret. Forster proposed a last-step min-max algorithm which was somewhat simpler than the algorithm of Vovk, yet with the same regret. In fact the algorithm he analyzed assumed that the choices of the adversary are bounded, yielding artificially only the two extreme cases. We fix this problem by weighing the examples in such a way that the min-max problem will be well defined, and provide analysis with logarithmic regret that may have better multiplicative factor than both bounds of Forster and Vovk. We also derive a new bound that may be sub-logarithmic, as a recent bound of Orabona et.al, but may have better multiplicative factor. Finally, we analyze the algorithm in a weak-type of non-stationary setting, and show a bound that is sub-linear if the non-stationarity is sub-linear as well.