CLSep 20, 2024

JMedBench: A Benchmark for Evaluating Japanese Biomedical Large Language Models

Junfeng Jiang, Jiahao Huang, Akiko Aizawa

arXiv:2409.13317v114.623 citationsh-index: 5Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of insufficient evaluation resources for Japanese biomedical LLMs, facilitating future research in this domain, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the lack of a comprehensive benchmark for evaluating Japanese biomedical large language models by proposing JMedBench, which includes eight LLMs and 20 datasets across five tasks, finding that models with better Japanese understanding and biomedical knowledge perform better, while non-specialized models can still achieve good results, with room for improvement in certain tasks.

Recent developments in Japanese large language models (LLMs) primarily focus on general domains, with fewer advancements in Japanese biomedical LLMs. One obstacle is the absence of a comprehensive, large-scale benchmark for comparison. Furthermore, the resources for evaluating Japanese biomedical LLMs are insufficient. To advance this field, we propose a new benchmark including eight LLMs across four categories and 20 Japanese biomedical datasets across five tasks. Experimental results indicate that: (1) LLMs with a better understanding of Japanese and richer biomedical knowledge achieve better performance in Japanese biomedical tasks, (2) LLMs that are not mainly designed for Japanese biomedical domains can still perform unexpectedly well, and (3) there is still much room for improving the existing LLMs in certain Japanese biomedical tasks. Moreover, we offer insights that could further enhance development in this field. Our evaluation tools tailored to our benchmark as well as the datasets are publicly available in https://huggingface.co/datasets/Coldog2333/JMedBench to facilitate future research.

View on arXiv PDF

Similar