CL AI LG PFDec 19, 2024

MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

Qihao Zhao, Yangyu Huang, Tengchao Lv, Lei Cui, Qinzheng Sun, Shaoguang Mao, Xin Zhang, Ying Xin, Qiufeng Yin, Scarlett Li, Furu Wei

arXiv:2412.15194v114.938 citationsh-index: 16Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the issue of unreliable evaluation results for LLM researchers and developers due to data leakage, though it is incremental as it builds on existing benchmarks like MMLU.

The paper tackles the problem of benchmark contamination in evaluating large language models (LLMs) by proposing MMLU-CF, a contamination-free multi-task language understanding benchmark, resulting in lower scores for models like GPT-4o, which achieved 73.4% in 5-shot and 71.9% in 0-shot settings.

Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF. This benchmark reassesses LLMs' understanding of world knowledge by averting both unintentional and malicious data leakage. To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules. To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification. Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard. The GitHub repository is available at https://github.com/microsoft/MMLU-CF and the dataset refers to https://huggingface.co/datasets/microsoft/MMLU-CF.

View on arXiv PDF Code

Similar