YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
This work addresses the need for robust, transparent, and scalable evaluation of LLMs in scientific contexts, supporting AI alignment and scientific inquiry.
The paper tackles the problem of evaluating large language models (LLMs) for scientific question answering by introducing YESciEval, an open-source framework that uses fine-grained rubric-based assessment and reinforcement learning to reduce optimism bias in LLM evaluators, and it releases multidisciplinary datasets with adversarial variants and evaluation scores from multiple LLMs.
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.