CL LGJun 19, 2025

Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing

Aly M. Kassem, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi

arXiv:2507.21084v14 citationsh-index: 7EMNLP

Originality Incremental advance

AI Analysis

This addresses a critical issue for LLM developers and users by providing a scalable tool to manage model behavior changes, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of detecting unintended side effects in large language models after fine-tuning or unlearning, introducing MNEME, a framework that uses sparse model diffing to predict these effects with up to 95% accuracy across various scenarios.

Large language models (LLMs) are frequently fine-tuned or unlearned to adapt to new tasks or eliminate undesirable behaviors. While existing evaluation methods assess performance after such interventions, there remains no general approach for detecting unintended side effects, such as unlearning biology content degrading performance on chemistry tasks, particularly when these effects are unpredictable or emergent. To address this issue, we introduce MNEME, Model diffiNg for Evaluating Mechanistic Effects, a lightweight framework for identifying these side effects using sparse model diffing. MNEME compares base and fine-tuned models on task-agnostic data (for example, The Pile, LMSYS-Chat-1M) without access to fine-tuning data to isolate behavioral shifts. Applied to five LLMs across three scenarios: WMDP knowledge unlearning, emergent misalignment, and benign fine-tuning, MNEME achieves up to 95 percent accuracy in predicting side effects, aligning with known benchmarks and requiring no custom heuristics. Furthermore, we show that retraining on high-activation samples can partially reverse these effects. Our results demonstrate that sparse probing and diffing offer a scalable and automated lens into fine-tuning-induced model changes, providing practical tools for understanding and managing LLM behavior.

View on arXiv PDF

Similar