CLAIApr 11, 2025

Large Language Models Could Be Rote Learners

arXiv:2504.08300v42 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the reliability of LLM evaluation for researchers and practitioners, but it is incremental as it builds on existing concerns about benchmark contamination.

The study tackled the problem of benchmark contamination in evaluating Large Language Models (LLMs) by reframing it as an inherent aspect of learning, and found that LLMs perform worse on memorized multiple-choice questions than non-memorized ones, indicating rote memorization coexists with genuine capability learning. The proposed TrinEval framework reduced memorization while preserving knowledge assessment, revealing that common LLMs may memorize by rote 20.5% of knowledge points on average in MMLU.

Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework reformulating MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes