Measuring Massive Multitask Chinese Understanding
This work addresses the problem of evaluating Chinese language models for researchers and developers, but it is incremental as it introduces a new benchmark rather than a novel method.
The authors tackled the lack of capability assessments for large-scale Chinese language models by proposing a test measuring multitask accuracy across medicine, law, psychology, and education domains, finding that the best models outperformed the worst by nearly 18.6 percentage points on average, with the highest zero-shot accuracy reaching 0.693 in clinical medicine but only 0.239 in law.
The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.