RoBiologyDataChoiceQA: A Romanian Dataset for improving Biology understanding of Large Language Models
This addresses the need for better LLM evaluation in low-resource languages and domain-specific applications, though it is incremental as it focuses on dataset creation and benchmarking.
The study tackled the problem of limited evaluation of large language models (LLMs) in domain-specific and non-English contexts by introducing a Romanian-language dataset of about 14,000 multiple-choice biology questions, and found that benchmarking revealed strengths and limitations in LLM performance for specialized knowledge tasks.
In recent years, large language models (LLMs) have demonstrated significant potential across various natural language processing (NLP) tasks. However, their performance in domain-specific applications and non-English languages remains less explored. This study introduces a novel Romanian-language dataset for multiple-choice biology questions, carefully curated to assess LLM comprehension and reasoning capabilities in scientific contexts. Containing approximately 14,000 questions, the dataset provides a comprehensive resource for evaluating and improving LLM performance in biology. We benchmark several popular LLMs, analyzing their accuracy, reasoning patterns, and ability to understand domain-specific terminology and linguistic nuances. Additionally, we perform comprehensive experiments to evaluate the impact of prompt engineering, fine-tuning, and other optimization techniques on model performance. Our findings highlight both the strengths and limitations of current LLMs in handling specialized knowledge tasks in low-resource languages, offering valuable insights for future research and development.