CLNov 11, 2024

Continual Memorization of Factoids in Language Models

Howard Chen, Jiayi Geng, Adithya Bhaskar, Dan Friedman, Danqi Chen

Princeton

arXiv:2411.07175v26.67 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses the issue of knowledge obsolescence in language models for AI applications, though it is incremental as it builds on existing continual learning approaches.

The paper tackles the problem of language models forgetting previously memorized factoids when fine-tuned on new data, showing that mixing random or generic data during training (REMIX) effectively mitigates forgetting and outperforms existing continual learning methods.

As new knowledge rapidly accumulates, language models (LMs) with pretrained knowledge quickly become obsolete. A common approach to updating LMs is fine-tuning them directly on new knowledge. However, recent studies have shown that fine-tuning for memorization may be ineffective in storing knowledge or may exacerbate hallucinations. In this work, we introduce a setting we call continual memorization, where a model must memorize and retain a set of factoids through multiple stages of fine-tuning on subsequent datasets. We characterized the forgetting patterns through extensive experiments and show that LMs widely suffer from forgetting, especially when needing to memorize factoids in the second stage. We posit that forgetting can be alleviated by modifying training dynamics: (1) protecting the memorization process when learning factoids or (2) reducing interference from subsequent training stages. Intriguingly, we find that mixing randomly generated word sequences or generic data sampled from pretraining corpora at different training stages effectively mitigates forgetting REMIX: Random and Generic Data Mixing). REMIX can recover performance from severe forgetting, outperforming replay methods and other continual learning baselines. We analyze how REMIX influences the learning process and find that robust memorization follows a distinct pattern: the model stores factoids in earlier layers than usual and diversifies the layers that retain them, which results in easier recall and manipulate of the learned factoids.

View on arXiv PDF Code

Similar