CLMay 22, 2025

A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

Shinnosuke Ono, Issey Sukeda, Takuro Fujii, Kosei Buma, Shunsuke Sasaki

arXiv:2505.16661v29.64 citationsh-index: 24Has CodeIJCNLP-AACL

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of limited NLP resources for Japanese pharmaceutical applications, providing a practical model and benchmarks for researchers and practitioners in healthcare NLP, though it is incremental in adapting existing methods to a specific domain.

The authors tackled the lack of Japanese language models for pharmaceutical NLP by developing a domain-specific model through continual pretraining on 10 billion tokens, which outperformed open-source models and achieved competitive performance with commercial ones on new benchmarks. They introduced three evaluation benchmarks, including one where even GPT-4o performed poorly, highlighting challenges in cross-sentence consistency reasoning.

We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.

View on arXiv PDF Code

Similar