MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning
This provides a reproducible framework for auditing model updates in biomedical NLP, addressing an incremental need for standardized evaluation in this domain.
The paper tackled the lack of a unified benchmark for continual learning in biomedical NLP by introducing MedCL-Bench, which evaluates 11 strategies across 10 datasets and finds that direct sequential fine-tuning causes catastrophic forgetting, with parameter-isolation offering the best retention per GPU-hour.
Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.