AIMar 17

MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

Min Zeng, Shuang Zhou, Zaifu Zhan, Rui Zhang

arXiv:2603.1673828.11 citationsh-index: 7

AI Analysis

This provides a reproducible framework for auditing model updates in biomedical NLP, addressing an incremental need for standardized evaluation in this domain.

The paper tackled the lack of a unified benchmark for continual learning in biomedical NLP by introducing MedCL-Bench, which evaluates 11 strategies across 10 datasets and finds that direct sequential fine-tuning causes catastrophic forgetting, with parameter-isolation offering the best retention per GPU-hour.

Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

View on arXiv PDF

Similar