Mechanistic Analysis of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
This work addresses catastrophic forgetting in continual learning systems for AI practitioners, providing mechanistic insights that are incremental but foundational for developing targeted mitigation strategies.
The paper tackled catastrophic forgetting in large language models during continual fine-tuning by identifying three primary mechanisms—gradient interference, representational drift, and loss landscape flattening—and found that forgetting severity correlates with task similarity (Pearson r = 0.87) and 15-23% of attention heads are severely disrupted.
Large language models exhibit remarkable performance across diverse tasks through pre-training and fine-tuning paradigms. However, continual fine-tuning on sequential tasks induces catastrophic forgetting, where newly acquired knowledge interferes with previously learned capabilities. Despite widespread observations of this phenomenon, the mechanistic understanding remains limited. Here, we present a comprehensive mechanistic analysis of catastrophic forgetting in transformer-based LLMs during sequential fine-tuning. Through systematic experiments across multiple model scales (109B to 400B total parameters) and task sequences, we identify three primary mechanisms driving forgetting: gradient interference in attention weights, representational drift in intermediate layers, and loss landscape flattening. We demonstrate that forgetting severity correlates strongly with task similarity (Pearson r = 0.87) and gradient alignment metrics. Our analysis reveals that approximately 15 to 23 percent of attention heads undergo severe disruption during fine-tuning, with lower layers showing greater susceptibility. These findings establish mechanistic foundations for developing targeted mitigation strategies in continual learning systems.