Towards Robust Evaluations of Continual Learning
This addresses the problem of unreliable benchmarks for researchers in continual learning, calling for a community-wide reprioritization of effort.
The paper identifies that current continual learning evaluations are flawed and misleading, and proposes new experiment designs to better assess approaches, demonstrating them with various methods and datasets.
Experiments used in current continual learning research do not faithfully assess fundamental challenges of learning continually. Instead of assessing performance on challenging and representative experiment designs, recent research has focused on increased dataset difficulty, while still using flawed experiment set-ups. We examine standard evaluations and show why these evaluations make some continual learning approaches look better than they are. We introduce desiderata for continual learning evaluations and explain why their absence creates misleading comparisons. Based on our desiderata we then propose new experiment designs which we demonstrate with various continual learning approaches and datasets. Our analysis calls for a reprioritization of research effort by the community.