CL AI LGMar 6, 2024

The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported

arXiv:2403.12082v14.29 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This is an incremental critique highlighting potential overstatements in model editing research, relevant for practitioners evaluating robustness claims.

The paper challenges a prior claim that a method effectively erases Harry Potter content from an LLM, showing through a small experiment that the model still generates such content repeatedly.

Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''

View on arXiv PDF

Similar