CLEVA: Chinese Language Models EVAluation Platform
This addresses the problem of inconsistent and contaminated evaluations for researchers and developers working with Chinese LLMs, though it is incremental as it builds on existing evaluation concepts for a specific language domain.
The authors tackled the lack of a comprehensive and standardized evaluation platform for Chinese Large Language Models by developing CLEVA, which includes a standardized workflow, a leaderboard, and contamination mitigation strategies, validated through experiments with 23 models.
With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 Chinese LLMs have validated CLEVA's efficacy.