CL AIApr 11, 2025

Large Language Models Could Be Rote Learners

Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin

arXiv:2504.08300v42 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the reliability of LLM evaluation for researchers and practitioners, but it is incremental as it builds on existing concerns about benchmark contamination.

The study tackled the problem of benchmark contamination in evaluating Large Language Models (LLMs) by reframing it as an inherent aspect of learning, and found that LLMs perform worse on memorized multiple-choice questions than non-memorized ones, indicating rote memorization coexists with genuine capability learning. The proposed TrinEval framework reduced memorization while preserving knowledge assessment, revealing that common LLMs may memorize by rote 20.5% of knowledge points on average in MMLU.

Multiple-choice question (MCQ) benchmarks are widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle genuine capability acquisition from superficial memorization in LLM evaluation. First, by analyzing model performance under different memorization conditions, we uncover a counterintuitive trend: LLMs perform worse on memorized MCQs than on non-memorized ones, indicating the coexistence of two distinct learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework reformulating MCQs into an alternative trinity format, reducing memorization while preserving knowledge assessment. Experiments validate TrinEval's effectiveness in reformulation, and its evaluation reveals that common LLMs may memorize by rote 20.5% of knowledge points (in MMLU on average).

View on arXiv PDF

Similar