CLAIJul 27, 2023

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

arXiv:2307.15020v175 citationsh-index: 22
Originality Synthesis-oriented
AI Analysis

This work addresses the gap in assessing LLMs for user preference in Chinese applications, though it is incremental as it builds on existing benchmarks like CLUE.

The authors tackled the problem of evaluating large language models (LLMs) in real-world scenarios by proposing SuperCLUE, a comprehensive Chinese benchmark that includes user queries, open-ended dialogues, and closed-ended questions, showing that accuracy on closed-ended questions alone is insufficient to reflect human preferences and that GPT-4 can reliably evaluate these preferences in Chinese contexts.

Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones. At the same time, they can complement each other to predict actual user preferences. We also demonstrate that GPT-4 is a reliable judge to automatically evaluate human preferences on open-ended questions in a Chinese context. Our benchmark will be released at https://www.CLUEbenchmarks.com

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes