The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported
This is an incremental critique highlighting potential overstatements in model editing research, relevant for practitioners evaluating robustness claims.
The paper challenges a prior claim that a method effectively erases Harry Potter content from an LLM, showing through a small experiment that the model still generates such content repeatedly.
Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''