AIMar 4, 2025

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

arXiv:2503.02296v25 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the debate on memorization vs. generalization in LLMs for code generation, providing a method to evaluate harmful memorization, which is incremental but offers specific insights for AI safety and code generation applications.

The paper tackles the problem of distinguishing harmful memorization from benign code reuse in LLM code generation by introducing a semantic perturbation method and a Memorization Risk Index (MRI). It finds that memorization does not increase with model scale, supervised fine-tuning improves accuracy but introduces memorization, and reinforcement learning with PPO achieves a better trade-off.

Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and neglecting task correctness under semantic variation. We define harmful memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding task, then reverse-engineers a novel coding task. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold -- when the model outputs similar code but fails the perturbed task -- thereby capturing harmful memorization rather than benign reuse of repeated code. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal that (1) memorization does not increase with larger models and in many cases alleviates as they scale; (2) supervised fine-tuning (SFT) improves accuracy while introduces memorization; (3) reinforcement learning with proximal policy optimization (PPO) achieves a more balanced trade-off between memorization and generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes