CL AIAug 26, 2024

Evaluating ChatGPT on Nuclear Domain-Specific Data

Muhammad Anwar, Mischa de Costa, Issam Hammad, Daniel Lau

arXiv:2409.00090v11 citationsh-index: 9

Originality Synthesis-oriented

AI Analysis

It addresses the problem of LLM hallucinations for users in the nuclear data field, though it is incremental as it applies existing methods to a new domain.

This paper evaluated ChatGPT's performance on nuclear domain-specific Q&A tasks, finding that a Retrieval Augmented Generation (RAG) approach improved accuracy and relevance compared to standalone LLM responses.

This paper examines the application of ChatGPT, a large language model (LLM), for question-and-answer (Q&A) tasks in the highly specialized field of nuclear data. The primary focus is on evaluating ChatGPT's performance on a curated test dataset, comparing the outcomes of a standalone LLM with those generated through a Retrieval Augmented Generation (RAG) approach. LLMs, despite their recent advancements, are prone to generating incorrect or 'hallucinated' information, which is a significant limitation in applications requiring high accuracy and reliability. This study explores the potential of utilizing RAG in LLMs, a method that integrates external knowledge bases and sophisticated retrieval techniques to enhance the accuracy and relevance of generated outputs. In this context, the paper evaluates ChatGPT's ability to answer domain-specific questions, employing two methodologies: A) direct response from the LLM, and B) response from the LLM within a RAG framework. The effectiveness of these methods is assessed through a dual mechanism of human and LLM evaluation, scoring the responses for correctness and other metrics. The findings underscore the improvement in performance when incorporating a RAG pipeline in an LLM, particularly in generating more accurate and contextually appropriate responses for nuclear domain-specific queries. Additionally, the paper highlights alternative approaches to further refine and improve the quality of answers in such specialized domains.

View on arXiv PDF

Similar