Timothée Lesort

LG
h-index19
23papers
1,716citations
Novelty38%
AI Score31

23 Papers

CLAug 8, 2023
Continual Pre-Training of Large Language Models: How to (re)warm your model?

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim et al.

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes available. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch$\unicode{x2013}$even for a large downstream dataset.

LGJul 10, 2022
Challenging Common Assumptions about Catastrophic Forgetting

Timothée Lesort, Oleksiy Ostapenko, Diganta Misra et al.

Building learning agents that can progressively learn and accumulate knowledge is the core goal of the continual learning (CL) research field. Unfortunately, training a model on new data usually compromises the performance on past data. In the CL literature, this effect is referred to as catastrophic forgetting (CF). CF has been largely studied, and a plethora of methods have been proposed to address it on short sequences of non-overlapping tasks. In such setups, CF always leads to a quick and significant drop in performance in past tasks. Nevertheless, despite CF, recent work showed that SGD training on linear models accumulates knowledge in a CL regression setup. This phenomenon becomes especially visible when tasks reoccur. We might then wonder if DNNs trained with SGD or any standard gradient-based optimization accumulate knowledge in such a way. Such phenomena would have interesting consequences for applying DNNs to real continual scenarios. Indeed, standard gradient-based optimization methods are significantly less computationally expensive than existing CL algorithms. In this paper, we study the progressive knowledge accumulation (KA) in DNNs trained with gradient-based algorithms in long sequences of tasks with data re-occurrence. We propose a new framework, SCoLe (Scaling Continual Learning), to investigate KA and discover that catastrophic forgetting has a limited effect on DNNs trained with SGD. When trained on long sequences with data sparsely re-occurring, the overall accuracy improves, which might be counter-intuitive given the CF phenomenon. We empirically investigate KA in DNNs under various data occurrence frequencies and propose simple and scalable strategies to increase knowledge accumulation in DNNs.

LGAug 30, 2022
Beyond Supervised Continual Learning: a Review

Benedikt Bagus, Alexander Gepperth, Timothée Lesort

Continual Learning (CL, sometimes also termed incremental learning) is a flavor of machine learning where the usual assumption of stationary data distribution is relaxed or omitted. When naively applying, e.g., DNNs in CL problems, changes in the data distribution can cause the so-called catastrophic forgetting (CF) effect: an abrupt loss of previous knowledge. Although many significant contributions to enabling CL have been made in recent years, most works address supervised (classification) problems. This article reviews literature that study CL in other settings, such as learning with reduced supervision, fully unsupervised learning, and reinforcement learning. Besides proposing a simple schema for classifying CL approaches w.r.t. their level of autonomy and supervision, we discuss the specific challenges associated with each setting and the potential contributions to the field of CL in general.

LGMar 2, 2022
Continual Feature Selection: Spurious Features in Continual Learning

Timothée Lesort

Continual Learning (CL) is the research field addressing learning without forgetting when the data distribution is not static. This paper studies spurious features' influence on continual learning algorithms. We show that continual learning algorithms solve tasks by selecting features that are not generalizable. Our experiments highlight that continual learning algorithms face two related problems: (1) spurious features and (2) local spurious features. The first one is due to a covariate shift between training and testing data, while the second is due to the limited access to data at each training step. We study (1) through a consistent set of continual learning experiments varying spurious correlation amount and data distribution support. We show that (2) is a major cause of performance decrease in continual learning along with catastrophic forgetting. This paper presents a different way of understanding performance decrease in continual learning by highlighting the influence of (local) spurious features in algorithms capabilities.

CLNov 2, 2023
Continual Learning Under Language Shift

Evangelia Gogoulou, Timothée Lesort, Magnus Boman et al.

The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic, and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for three different model sizes. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be positive or negative depending on the order and characteristics of new languages. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity best fits our results.

LGAug 2, 2021Code
Sequoia: A Software Framework to Unify Continual Learning Research

Fabrice Normandin, Florian Golemo, Oleksiy Ostapenko et al.

The field of Continual Learning (CL) seeks to develop algorithms that accumulate knowledge and skills over time through interaction with non-stationary environments. In practice, a plethora of evaluation procedures (settings) and algorithmic solutions (methods) exist, each with their own potentially disjoint set of assumptions. This variety makes measuring progress in CL difficult. We propose a taxonomy of settings, where each setting is described as a set of assumptions. A tree-shaped hierarchy emerges from this view, where more general settings become the parents of those with more restrictive assumptions. This makes it possible to use inheritance to share and reuse research, as developing a method for a given setting also makes it directly applicable onto any of its children. We instantiate this idea as a publicly available software framework called Sequoia, which features a wide variety of settings from both the Continual Supervised Learning (CSL) and Continual Reinforcement Learning (CRL) domains. Sequoia also includes a growing suite of methods which are easy to extend and customize, in addition to more specialized methods from external libraries. We hope that this new paradigm and its first implementation can help unify and accelerate research in CL. You can help us grow the tree by visiting www.github.com/lebrice/Sequoia.

LGDec 21, 2018Code
Generative Models from the perspective of Continual Learning

Timothée Lesort, Hugo Caselles-Dupré, Michael Garcia-Ortiz et al.

Which generative model is the most suitable for Continual Learning? This paper aims at evaluating and comparing generative models on disjoint sequential image generation tasks. We investigate how several models learn and forget, considering various strategies: rehearsal, regularization, generative replay and fine-tuning. We used two quantitative metrics to estimate the generation quality and memory ability. We experiment with sequential tasks on three commonly used benchmarks for Continual Learning (MNIST, Fashion MNIST and CIFAR10). We found that among all models, the original GAN performs best and among Continual Learning strategies, generative replay outperforms all other methods. Even if we found satisfactory combinations on MNIST and Fashion MNIST, training generative models sequentially on CIFAR10 is particularly instable, and remains a challenge. Our code is available online \footnote{\url{https://github.com/TLESORT/Generative\_Continual\_Learning}}.

LGMar 13, 2024
Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim, Benjamin Thérien, Kshitij Gupta et al.

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

CVJan 9, 2025
Atlas: A Novel Pathology Foundation Model by Mayo Clinic, Charité, and Aignostics

Maximilian Alber, Stephan Tietz, Jonas Dippel et al.

Recent advances in digital pathology have demonstrated the effectiveness of foundation models across diverse applications. In this report, we present Atlas, a novel vision foundation model based on the RudolfV approach. Our model was trained on a dataset comprising 1.2 million histopathology whole slide images, collected from two medical institutions: Mayo Clinic and Charité - Universtätsmedizin Berlin. Comprehensive evaluations show that Atlas achieves state-of-the-art performance across twenty-one public benchmark datasets, even though it is neither the largest model by parameter count nor by training dataset size.

LGJun 3, 2021
Continual Learning in Deep Networks: an Analysis of the Last Layer

Timothée Lesort, Thomas George, Irina Rish

We study how different output layer parameterizations of a deep neural network affects learning and forgetting in continual learning settings. The following three effects can cause catastrophic forgetting in the output layer: (1) weights modifications, (2) interference, and (3) projection drift. In this paper, our goal is to provide more insights into how changing the output layer parameterization may address (1) and (2). Some potential solutions to those issues are proposed and evaluated here in several continual learning scenarios. We show that the best-performing type of output layer depends on the data distribution drifts and/or the amount of data available. In particular, in some cases where a standard linear layer would fail, changing parameterization is sufficient to achieve a significantly better performance, without introducing any continual-learning algorithm but instead by using standard SGD to train a model. Our analysis and results shed light on the dynamics of the output layer in continual learning scenarios and suggest a way of selecting the best type of output layer for a given scenario.

LGApr 4, 2021
Understanding Continual Learning Settings with Data Distribution Drift Analysis

Timothée Lesort, Massimo Caccia, Irina Rish

Classical machine learning algorithms often assume that the data are drawn i.i.d. from a stationary probability distribution. Recently, continual learning emerged as a rapidly growing area of machine learning where this assumption is relaxed, i.e. where the data distribution is non-stationary and changes over time. This paper represents the state of data distribution by a context variable $c$. A drift in $c$ leads to a data distribution drift. A context drift may change the target distribution, the input distribution, or both. Moreover, distribution drifts might be abrupt or gradual. In continual learning, context drifts may interfere with the learning process and erase previously learned knowledge; thus, continual learning algorithms must include specialized mechanisms to deal with such drifts. In this paper, we aim to identify and categorize different types of context drifts and potential assumptions about them, to better characterize various continual-learning scenarios. Moreover, we propose to use the distribution drift framework to provide more precise definitions of several terms commonly used in the continual learning field.

LGFeb 11, 2021
Continuum: Simple Management of Complex Continual Learning Scenarios

Arthur Douillard, Timothée Lesort

Continual learning is a machine learning sub-field specialized in settings with non-iid data. Hence, the training data distribution is not static and drifts through time. Those drifts might cause interferences in the trained model and knowledge learned on previous states of the data distribution might be forgotten. Continual learning's challenge is to create algorithms able to learn an ever-growing amount of knowledge while dealing with data distribution drifts. One implementation difficulty in these field is to create data loaders that simulate non-iid scenarios. Indeed, data loaders are a key component for continual algorithms. They should be carefully designed and reproducible. Small errors in data loaders have a critical impact on algorithm results, e.g. with bad preprocessing, wrong order of data or bad test set. Continuum is a simple and efficient framework with numerous data loaders that avoid researcher to spend time on designing data loader and eliminate time-consuming errors. Using our proposed framework, it is possible to directly focus on the model design by using the multiple scenarios and evaluation metrics implemented. Furthermore the framework is easily extendable to add novel settings for specific needs.

LGJul 1, 2020
Continual Learning: Tackling Catastrophic Forgetting in Deep Neural Networks with Replay Processes

Timothée Lesort

Humans learn all their life long. They accumulate knowledge from a sequence of learning experiences and remember the essential concepts without forgetting what they have learned previously. Artificial neural networks struggle to learn similarly. They often rely on data rigorously preprocessed to learn solutions to specific problems such as classification or regression. In particular, they forget their past learning experiences if trained on new ones. Therefore, artificial neural networks are often inept to deal with real-life settings such as an autonomous-robot that has to learn on-line to adapt to new situations and overcome new problems without forgetting its past learning-experiences. Continual learning (CL) is a branch of machine learning addressing this type of problem. Continual algorithms are designed to accumulate and improve knowledge in a curriculum of learning-experiences without forgetting. In this thesis, we propose to explore continual algorithms with replay processes. Replay processes gather together rehearsal methods and generative replay methods. Generative Replay consists of regenerating past learning experiences with a generative model to remember them. Rehearsal consists of saving a core-set of samples from past learning experiences to rehearse them later. The replay processes make possible a compromise between optimizing the current learning objective and the past ones enabling learning without forgetting in sequences of tasks settings. We show that they are very promising methods for continual learning. Notably, they enable the re-evaluation of past data with new knowledge and the confrontation of data from different learning-experiences. We demonstrate their ability to learn continually through unsupervised learning, supervised learning and reinforcement learning tasks.

LGDec 6, 2019
Regularization Shortcomings for Continual Learning

Timothée Lesort, Andrei Stoian, David Filliat

In most machine learning algorithms, training data is assumed to be independent and identically distributed (iid). When it is not the case, the algorithm's performances are challenged, leading to the famous phenomenon of catastrophic forgetting. Algorithms dealing with it are gathered in the Continual Learning research field. In this paper, we study the regularization based approaches to continual learning and show that those approaches can not learn to discriminate classes from different tasks in an elemental continual benchmark: the class-incremental scenario. We make theoretical reasoning to prove this shortcoming and illustrate it with examples and experiments. Moreover, we show that it can have some important consequences on continual multi-tasks reinforcement learning or in pre-trained models used for continual learning. We believe that highlighting and understanding the shortcomings of regularization strategies will help us to use them more efficiently.

LGJul 11, 2019
DisCoRL: Continual Reinforcement Learning via Policy Distillation

René Traoré, Hugo Caselles-Dupré, Timothée Lesort et al.

In multi-task reinforcement learning there are two main challenges: at training time, the ability to learn different policies with a single model; at test time, inferring which of those policies applying without an external signal. In the case of continual reinforcement learning a third challenge arises: learning tasks sequentially without forgetting the previous ones. In this paper, we tackle these challenges by proposing DisCoRL, an approach combining state representation learning and policy distillation. We experiment on a sequence of three simulated 2D navigation tasks with a 3 wheel omni-directional robot. Moreover, we tested our approach's robustness by transferring the final policy into a real life setting. The policy can solve all tasks and automatically infer which one to run.

LGJun 29, 2019
Continual Learning for Robotics: Definition, Framework, Learning Strategies, Opportunities and Challenges

Timothée Lesort, Vincenzo Lomonaco, Andrei Stoian et al.

Continual learning (CL) is a particular machine learning paradigm where the data distribution and learning objective changes through time, or where all the training data and objective criteria are never available at once. The evolution of the learning process is modeled by a sequence of learning experiences where the goal is to be able to learn new skills all along the sequence without forgetting what has been previously learned. Continual learning also aims at the same time at optimizing the memory, the computation power and the speed during the learning process. An important challenge for machine learning is not necessarily finding solutions that work in the real world but rather finding stable algorithms that can learn in real world. Hence, the ideal approach would be tackling the real world in a embodied platform: an autonomous agent. Continual learning would then be effective in an autonomous agent or robot, which would learn autonomously through time about the external world, and incrementally develop a set of complex skills and knowledge. Robotic agents have to learn to adapt and interact with their environment using a continuous stream of observations. Some recent approaches aim at tackling continual learning for robotics, but most recent papers on continual learning only experiment approaches in simulation or with static datasets. Unfortunately, the evaluation of those algorithms does not provide insights on whether their solutions may help continual learning in the context of robotics. This paper aims at reviewing the existing state of the art of continual learning, summarizing existing benchmarks and metrics, and proposing a framework for presenting and evaluating both robotics and non robotics approaches in a way that makes transfer between both fields easier.

LGJun 11, 2019
Continual Reinforcement Learning deployed in Real-life using Policy Distillation and Sim2Real Transfer

René Traoré, Hugo Caselles-Dupré, Timothée Lesort et al.

We focus on the problem of teaching a robot to solve tasks presented sequentially, i.e., in a continual learning scenario. The robot should be able to solve all tasks it has encountered, without forgetting past tasks. We provide preliminary work on applying Reinforcement Learning to such setting, on 2D navigation tasks for a 3 wheel omni-directional robot. Our approach takes advantage of state representation learning and policy distillation. Policies are trained using learned features as input, rather than raw observations, allowing better sample efficiency. Policy distillation is used to combine multiple policies into a single one that solves all encountered tasks.

LGJan 24, 2019
Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics

Antonin Raffin, Ashley Hill, René Traoré et al.

Scaling end-to-end reinforcement learning to control real robots from vision presents a series of challenges, in particular in terms of sample efficiency. Against end-to-end learning, state representation learning can help learn a compact, efficient and relevant representation of states that speeds up policy learning, reducing the number of samples needed, and that is easier to interpret. We evaluate several state representation learning methods on goal based robotics tasks and propose a new unsupervised model that stacks representations and combines strengths of several of these approaches. This method encodes all the relevant features, performs on par or better than end-to-end learning with better sample efficiency, and is robust to hyper-parameters change.

LGOct 29, 2018
Marginal Replay vs Conditional Replay for Continual Learning

Timothée Lesort, Alexander Gepperth, Andrei Stoian et al.

We present a new replay-based method of continual classification learning that we term "conditional replay" which generates samples and labels together by sampling from a distribution conditioned on the class. We compare conditional replay to another replay-based continual learning paradigm (which we term "marginal replay") that generates samples independently of their class and assigns labels in a separate step. The main improvement in conditional replay is that labels for generated samples need not be inferred, which reduces the margin for error in complex continual classification learning tasks. We demonstrate the effectiveness of this approach using novel and standard benchmarks constructed from MNIST and FashionMNIST data, and compare to the regularization-based \textit{elastic weight consolidation} (EWC) method.

LGSep 25, 2018
S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning

Antonin Raffin, Ashley Hill, René Traoré et al.

State representation learning aims at learning compact representations from raw observations in robotics and control applications. Approaches used for this objective are auto-encoders, learning forward models, inverse dynamics or learning using generic priors on the state characteristics. However, the diversity in applications and methods makes the field lack standard evaluation datasets, metrics and tasks. This paper provides a set of environments, data generators, robotic control tasks, metrics and tools to facilitate iterative state representation learning and evaluation in reinforcement learning settings.

LGJun 28, 2018
Training Discriminative Models to Evaluate Generative Ones

Timothée Lesort, Andrei Stoain, Jean-François Goudou et al.

Generative models are known to be difficult to assess. Recent works, especially on generative adversarial networks (GANs), produce good visual samples of varied categories of images. However, the validation of their quality is still difficult to define and there is no existing agreement on the best evaluation process. This paper aims at making a step toward an objective evaluation process for generative models. It presents a new method to assess a trained generative model by evaluating the test accuracy of a classifier trained with generated data. The test set is composed of real images. Therefore, The classifier accuracy is used as a proxy to evaluate if the generative model fit the true data distribution. By comparing results with different generated datasets we are able to classify and compare generative models. The motivation of this approach is also to evaluate if generative models can help discriminative neural networks to learn, i.e., measure if training on generated data is able to make a model successful at testing on real settings. Our experiments compare different generators from the Variational Auto-Encoders (VAE) and Generative Adversarial Network (GAN) frameworks on MNIST and fashion MNIST datasets. Our results show that none of the generative models is able to replace completely true data to train a discriminative model. But they also show that the initial GAN and WGAN are the best choices to generate on MNIST database (Modified National Institute of Standards and Technology database) and fashion MNIST database.

AIFeb 12, 2018
State Representation Learning for Control: An Overview

Timothée Lesort, Natalia Díaz-Rodríguez, Jean-François Goudou et al.

Representation learning algorithms are designed to learn abstract features that characterize data. State representation learning (SRL) focuses on a particular kind of representation learning where learned features are in low dimension, evolve through time, and are influenced by actions of an agent. The representation is learned to capture the variation in the environment generated by the agent's actions; this kind of representation is particularly suitable for robotics and control scenarios. In particular, the low dimension characteristic of the representation helps to overcome the curse of dimensionality, provides easier interpretation and utilization by humans and can help improve performance and speed in policy learning algorithms such as reinforcement learning. This survey aims at covering the state-of-the-art on state representation learning in the most recent years. It reviews different SRL methods that involve interaction with the environment, their implementations and their applications in robotics control tasks (simulated or real). In particular, it highlights how generic learning objectives are differently exploited in the reviewed algorithms. Finally, it discusses evaluation methods to assess the representation learned and summarizes current and future lines of research.

AISep 15, 2017
Unsupervised state representation learning with robotic priors: a robustness benchmark

Timothée Lesort, Mathieu Seurin, Xinrui Li et al.

Our understanding of the world depends highly on our capacity to produce intuitive and simplified representations which can be easily used to solve problems. We reproduce this simplification process using a neural network to build a low dimensional state representation of the world from images acquired by a robot. As in Jonschkowski et al. 2015, we learn in an unsupervised way using prior knowledge about the world as loss functions called robotic priors and extend this approach to high dimension richer images to learn a 3D representation of the hand position of a robot from RGB images. We propose a quantitative evaluation of the learned representation using nearest neighbors in the state space that allows to assess its quality and show both the potential and limitations of robotic priors in realistic environments. We augment image size, add distractors and domain randomization, all crucial components to achieve transfer learning to real robots. Finally, we also contribute a new prior to improve the robustness of the representation. The applications of such low dimensional state representation range from easing reinforcement learning (RL) and knowledge transfer across tasks, to facilitating learning from raw data with more efficient and compact high level representations. The results show that the robotic prior approach is able to extract high level representation as the 3D position of an arm and organize it into a compact and coherent space of states in a challenging dataset.